DOC0H2HT BESDHE 



ED 118 579 

• ♦ 

iUTHOE 
TITLE 



IHSTITOTICfH 

spossAgency 



H?POET HO 
P0« DATE 
HOTE 



AVilLiBLE PBOM 



oEDES PRICE 
DZSCSIPTOHS 



> . 



T|! 005 0»0' 

P^squarielldr Bernard G. ; Wishik^ Samuel JJ* 
Ejraluating Training Effectiveness and Trainee' 
ichievement: Hethodology for fleasureaent of Changes 
in Levels o,f Cognitive Competence • Manuals for 
Evaluation of Pamily Planning and Population 
Prograas, Nuober 8, 

Colui^bia' Dniv. r Hev York, S.Y. International Inst. ' 

for the Study of ffunan Reproduction. 

i^ncy fpr International Develops^nt (Dept. of 

StVte) r Washington r D.C. ; Ford Foundation, Hev York/ 

T?.Y. 

Han-8 

75 

221p''. ; Some of the*Figures in the text and so ae- pages 
in the appendices may reproduce poorly due to small' 
print 

International Insitate for the Study of Human 
Reproduction, 76 Haven Avenue, Bev York, Hev York 
14032 ($3.00) ... 

'!5F-$0.83 Plus Postage. HC^ Hot .Available from EDRS. 
Academic Achievement; Achievement G.ains;/Achievefient 
Tests; *Cognitive. Heasureaent ; Comparative Analysis; 
Computer Programs; Data Analysis; *Sdticational '* - 
Programs; Evaluation Hethods ; Family Planning; ^ 
♦Guidelines; Manuals; "Program Effectiveness; *Progran 
Evaluation; Statistical Analysis; Test Construction; 
♦Trainees; Training-; Training Objectives ^ 



ABS^?ACT 

This Manual has been 
^^^v guidelines for conducting an evaluati 
sequence* The assessment design to be 
the testing of a group of trainees be 
'instruction by administration of the 
"under structured testing conditions; 
of statistical procedures to the resu 
responses to determine the magnitude, 
Hetest changes in cognitive (subject 
stressed repeatedly throughout the Ma 
. of the testing data can provide both 
and an assessment of training effedti 
of the increase in levels 'of .subject 
trainees at fhe $nd of the course can 
experience. {Xtithor)'-^ 



designed to provide step-by-step 
on of a structured training 

presented involves essentially: 
fore and after a sequence of 
same set of objective-form items 
and the application of a series 
Itant ^scores and individual item 

direction and level of Test to 
matter) competence. As will be 
nual , the' quantitative analysis 
a measure of trainee achievement 
^enelss^ by estimating hov much 
competence displayed by the 
^ be attributed to the training 



Docuihents acquired by ERIC include many informal onpubMsh^ materials net avaflahlc from other sources. ERIC maKcs every 
effort to obtain the best copy available. Nevertheless, items of margmai^^cep^odaclbihty are often encountered and this affects tha 
quality of the microfiche and hardcopy teproductions ERIC makes available via the ERIC Document Reproduction Service ^E^RS). 
cr,Dc i3 not responsible for the quahty of t^ie origir;al document- Reprodi^ctions^fupplied by EDRS are the best that can be made from 
> jinal, - - - 



ERIC 



Manuals for tVaiUation of ' Manual Number 

Family Planning and Population Programs qfasenes 



8 



CO' 



CD 

lu 



EVALUATING TRAINING 
EFFECTIVENESS AND 
IBAINEE ACHIEVEMENT 

.ftlethodolog/for Measurernent of 
Changes in Levels of Cognitive 
Competence . 



Bernard G. Pasquariella 
Samuel M. wishlk 



, ois 0' -c '■■ 





DIVJSION OpIbCIAL AND ADMINISTRATIVE SCIENCES 
O tNATIONAL iNSTtTUTE FOR THE STUDY OF HUMAN REPRODUCTIOlil 

ERJC iMBlA UNIVERSITY / 78 Haven Avenue / New York, New York 10032 



EVALUATING TRAINING EFFECTIVENESS AND TRAINEE ACHIEVEMENT 

Methbdology for Measurement of Changes 
in Levels of Cognitive Competence 



by 



Bernard G. Pasquariella , M.A,* 
Samuel M.-Wishik, M,D,, MPH** 



With the assistance of: 
William Wilkinson 



^Research Associate (Psychology\q4 ' ^JoJaJ 

of social -and Administrative Sciences, International 
institute for the, Study of Human Reproduction; 
Columbia University, 78 Haven Avenue, . 
New York, N.Y. USA 10032 

1975 , 



erJc 



3 



/ 



. . Published by: 

upiversity/78 Haven Avenue/New York, New Yoj>k 10032 
.* Pointed- in the United States of America 



Of Sfpl^f Pr'.'tf' P^^^^l^' i» P^t. by financial s,5,port 
Library of ■'congress Catalog Card Nutnbeif: 75-7901 




PREFACE 

The- assessment procedures that form the bakis of this Manual 
were originally developed in response to a request from the 
Demographic Association of El Salvador ,for help in evaluating 
a sez4.es of training programs in Population and Family Plan- 
ning, and Maternal/Child Health, each directed at a different 
professional or paraprof essional level; ^ * 

The one objective common to all programs in the series was 
that the trainees acquire a body of knowledge <and funda- 
mental skills) ^ a number of Public Health-related subjept 
areas*. Thus, the evaluation of instruction was to focus on . 
an assessment of the amount of ^ubstantive learning that 
occurred among the trainees in the various subject areas • 
^ The assessment f>/as to be effected, by means of a single ob- 
' jective test instrument administered twice under a Pre-/Post- 
Instruction design. Rather than create an achievement test 
instrument on the, subject mat.ter and send it to them, ^ it w^s 
decided- that a ,bet4;.er procedure would be to prepare a set of 
* guidelines %f or the preparation of the test instniment and 
alibw the Association to Create its own test to meet the 
specific program needs* ^s a result, the idea for a complete 
Manual^* was born* 

Field tes^ting of the methodology outlined in the guidelines 
was later conducted, at the invitation of the' US Agency for - 
International Development? for a training prograim^^n 
Washington, D*C. involving a government- sponsored Population/ 
Family Planning Prograjn Seminar-Workshop. Additional ex- 
perience with the methods, leading to some modification of ^ 
the Manual, was provided by a recjuest for an evaluation con- 
sultation by the Department 'of Health and Family Protection 
at the National School of Public Health in Rennet, France in 
November 1973. (A more complete disdussion of the background 
'of the Manual is pro^^ided in Appendix A.)- 

While patts of the evaluation procedures are newly introduced 
here, the steps involved in the instrument design are for the 
most part not innovative, but are based on what may be taken 
as standarci thinking on the subject of achievement testing 
and measurement. It was not felt that educational and 
psychdmetric theory need be brought into the body of the 
Manual. It is attuned that the reader will accept the au- 
. thority of.' the sources listed in the bibliography^ However^ 
an appendix discussing some of the theoretical aspects of 
achievement test design and the Test/Retest model has been 
provided. ^ 




The Manual, ha? been designed to be -a self-contained, com- 

^° objective achievement testing for purposes of 
training f>rograra evaluation. The text has therefore been 
to take the- reader step-by-step through the pro- 
cedures — from designing the test instruipent U.e., con- 
structing test items and creating the test format), through 

II fr administration of the test, coding, and scoring! 
dal fn/^K^"""^ statistical analyses that provide the fin^l' 

t evaluation. (There are also, appendices covering 
adSTtinn^?''- • ^^thodology for the reader who requires^ 
^tit^^ n ^"^°"°^tion in designing the achievement instru- 
ment or conducting the evaluation.) 

mTE: Information about individual trainees that can be de- 
rived from analysis, of the testing data is not recommended 
|or use in discriminating between trainees on matters such as 
job placement, salaries, future promotions, etc. The primary 
focus on the analysis of test data relating to individual 
trainees is for purposes of evaluating the effectiveness of 
a sequence of training, as it is reflected in the performance 
Of the ttamees as a group, and by 'variations in performances 
among subgroups and individuals. 



6 



ERIC 



^- iv 



ACKNOWLEDGMENTS 



Appreciation is extended to individual pembers of ti\e Insti- 
tute staff for particular assistance. The authors are great- 
ly indebted to ^il'liam Wilkinson for his invaluable contribu- 
tions as editor," critic and reviewer throughput the prepara- 
tion ©f this Manual.* His creativity and imagination, espe- „ 
cially in the 4esi'gnlng of many, of the s.ampl'e , achievement 
test items provided in thfe tQKt, is als® appreciated. . Dr. 
David Wolfer's, Epidemiologist, is named in the text for sev- 
eral valuable /oriainal formulations. Dr. Prem 'Talwar, .Senior 
Statistician, ' contributed to statistical presentations and 
JOcUine Revson assisted in the (development of the test instru- 
^nents employed/ in 1?he overseas field trials-. 

Our thanks also to the staffs 6f the Governmental' A"f fairs 
Institute, Washington, D.' C. and the Dep^artment of Health and 
Family Protection, National School of Public Health, Rennes, 
France for assistance in testing our methodolqgy in th'feir 
trainings programs, ' * 



ERIC 



■A 



7 



Those who 3udge of .a work by rule are in regard to others 
.as those who have a watch. Are in regard to others;. One 
saysr^ It IS two hours ago"; the other says, "It is onlv 
three quarters of an hour." I look at jny watch and say to 

i?th'^vnn « T ^--f ^""^ ^° "Time gallops 

J^\J°'' ""^i' a"<3 a half agto,' and I 

aS^Jh^^ T I "^S 'i'^ ^^^^ sl°"ly- with me 

and that I 3udge by imagination. .They don't know that I 
3udge by my'w4tch. (Pascal, Pense^es) ' ' """^^ / ^ 



8 



ERIC 



^ ' vi 



TABLE OF CONTENTS 



Preface iii 

Acknowledgments ' v ^ 

Chapter I: General Description of the Methodology • • 1 

Training Evaluation Hierarchy 3 

A' Note on Terminology . . • • • 6 

' Chapter II: The Test Instrument 7 



Designing the Test Instrument 

Overview 

Achievement .Areas . 

Siibject Content Areas 

Two-way Iteiti Specification Table • \^ • • • • 

, Defining Relative Test Emphasis 

Determining Total Number of Test Items . . • 

Objective rtems * * 

Use of Objective Items : 

Forms of Objective I^ems 

Gefreral Guidelines for Construction 
of Objective Items 

Non-Staff Lecturers and the ^Problem 
of Adequate Item Coverage 

The Test Format 

General Considerations 
The Test Item Booklet 
The Answer Sheet • • 



Administering the Instrument 

Extraneous Factors 41 

Test Instructions 42 

Guidelines for Administering 

the Jfnstrument * • 45 

Chapter III J Coding & Preparation of Item Response 

DatA for Statistical Analysis 48 

Manual/Mechanical Processing ^ 

The Scoring Stencil 48 

Th^ Score Profile 51 

Processing for Hand Scoring 51 



7 
8 
12 
12 
16 
21 



25 
26 

28- 



32 



35 
36 
38 



,9 



\ 



Chapter ill (Continued): \ 
Hand Scoring Procedures 

^ , General Scoring: item Set and Composite 

Scores . . . 53 

Scoring by Trainee Subgroup: item set and 

Composite Scores 54 

Computer Processing 

Editing . ^ 55 

Coding ^ ! ! ! " 55 

Punch Card* Data Format 56 

Keypunching . . . # ] * 5*7 

. Computer Scoring Procedures 

General Scoring: item Set and Composite 

Scores . . . . ^ 57 

Scoring by Trainee Subgroup: item Set and' 

Composite Sco'res 60 

Timing of the Scoring: The Pre-Test 60 

Chapter IV: Utility of the Pre-Test / . 61 

Chapter V: The Curriculum Audif 65 

Definition and Purpose . . 65 

Concurrent and Retrospective Auditing ...... 66 

Using .the Results * * - . . . 72 

ChaRter^^: Utility o^he Post-Test 74 

Situations Requiring Post-Test Revision . . . ^^jUj^ 75. 

Chapter VII: Analysis and Interpretation of 

/ Response Data ^ . . 81 

General Considerations and Overview 81 

Analysis of Data 

Applications of Statistical Tests for the • 
Significance of Pre- to Post-Test Score 
Increases \ . . . * 82 



Stage 1 
Stage 2 
Stage 3 



Subject Area Scores 89 

Individual Trainee Scores 90 

Item Response Patterns ^ 91 



ERIC 



Using the Analysis Results: Evaluating Statis- 

ticfil Test Results . . . ' 99 ' 

Criteria of Achievement' \* ! ! ! 102 



10 

viii 



Chapter VII (Continued) : 

Level and Magnitude of Score Movement 103 

Summary 114 

Presentation of Data: ' The' Evaluation Report - - - 118 

Chapter VIII: Assessing the Test Instrument 121 

'^em Analysis - -121 

Procedural Steps 121 

Interpreting the Item," Analysis Data 1^2 

Application of Item Analysis ' 125 

Appendices ..129 

References ' : ^j:^^ 

Bibliography • 1^ 

/ 

# 

II 

er|c \ iM • 



^ ^ FIGURES IN TH^ TEXT 



> 

4 



ample item Specification Ta^l^ . .' 14 

• 40 

Model Scoring Stencil I^/- 49 



Model Scoring Stenc 



50 

Score Profile With Test/Retest Scores 52 

.Data Card Coding Form 5g 

Sample item Coverage Checklist 67 

Sample S\abject Coverage-by-Session Table • . 70 

Program Flowchart for Com{)uter Analysis , , , , 83 

Mean Pre^/Post-Test Scores for Items Sets 
and Composite Test 

Item Response Patterns by Individual 'Trainee • 93 

Item Response Patterns by Individual Item , . . 96 

Analysis Summary Profile 100 

Achievement/Colnpetence Scores (Unweighted) / . 105 

Weighted Achievement/Competence Score Curves . 109' 

Weighted Achievement/Competence Scores • , . . lio 

Analysis 'Summary Profile II • , . • 1^5 

Analysis Summary Profile II (cont'd.) 116 

Sample Card Format with Itera^Characteristics V 123 



X 



CHAPTER I 



•GENERAL DESCRIPTION OF THE METHODOLOGY 



This Manual has been designed to provide step-by-step guide- 
lines fof conducting cin evaluation of a structured training . 
sequence. 

The assessment desi.gn to be presented involves essentially: / 
the testing of a group of trainees before and after a se- 
quence of instruction fey administration of the same set of 
ob5ective-f orm items under structured testing, conditions; aiid 
the application of' a series of statistical procedures to the 
resultant scores and individual item responses to determine 
the magnitude, direction and level of Test to Retest changes 
in cognitive (subjett matter) competence*. As will be 
stressed .repeatedly throughout the Manual, the guajititative 
analysis of the testing data cam, provide both a measure of 
trainee achievement and an assessment of training effective- 
ness', by estimating how much of the increase in levels of 
subject competence displayed by the trainees at the end of. 
the course can be attributed to the training experience. 

The Manual is in two parts^ The .first deals with the con- 
struction of the test instruiftent, its administration,, and 
methods for ensuring that the material covered^ in ttie test 
is fcompatible with what is being^^.lamned for the course and 
checking that the material whic^ was included in the test 
wa^ actually covered in the coi^s^. 

The second, part of the Manual deals with comparative analysis 
of the- two applications of the test, with each step e^q^lained 
•so that it can be done by. either mamual/mechaiiical methods or 
by computer. ^ , » 

The fifst part of the Manual vdLll be the larger and more 
detailed of the two. Although the statistical analysis of the 
test results, is importamt to the assessment outcome, it* is the 
initial planning, construction cuid application stages that must 
be given special attention to/ensure that the te^t instrument 
will serve its intended pur^Jose. 



* See Note on Terminology , p. 6. This concept refers basically 
to the acquisition and mastery (in terms of knowledge, under- 
standing and application) of the subject material of inatruc- 
^on. (Further di^ussion of this concept is provided on 
ppi 7-8, and in Appendix B.) ^ . ' 



t 



Before presenting the practical, how-to aspects of the method- 
ology, however, it is necessary to provide a^ small' amount of 
bacKground and to discuss the uses to which an instrument 
'such as this cctn be appropriately put. 

In order to decide on an evaluation procedure for any sequence 
of training, the objectives of the training must first be 
dete^i^ied. Training program administrators have tradition- 
ally evaluated the reception and impact of their programs by 
a number of informal and subjective approaches such as ques- 
tionnaires, rating scales, and checklists. All of these 
self-report procedures have be^n used to try to elicit the 
following information: 

1. In relation to their needs and interests, what the 
individual trainees got out of*-the training • 

2- Rating of training sessions / typically on a scale 
from poor to excellent. 

3. Rating of individual training sessions in terms -of 
• Selected aspects. 

4. Extent to which trainees felt that the training had 
prepared them for future work in the field. 

5. Rating of instructors in terms of selected con- 
siderations. 

While these approaches to evaluation may provide qualitative 
assessments of a program's impact by identifying strengths 
and weaknesses as reported by trainees and by indicating 
trainees' feelings about the traiiiing, they do not usually 
supply an administrator with. substantive objective feedback 
of the type reqoiired to assess the degree of effectiveness of 
current training and to implement improvements for the future. 
The evaluation methodology presented here was developed. to 
provide objective, quantitative feedback to training adminis- 
trators whose aim is to increase the levels of cognitive com- 
petence of trainees in specific subject areas — ^ that is, it 
provides an administrator with the means to assess the effec- 
tiveness of the training in increasing the trainees' competence 
with the subject matter of instruction through their ability 
to adapt and apply what was learned to decision-making and 
problem-solving situations. 

The following procedural format is employed in assessing the 
effectiveness of instiniciitJn in terms of increasing levels of 
subject matter CLompat^ce^ 

14 ■' 




1. A pre-instruction baseline level of competence 
(assessing the degree to which the' trainee has 
already acquired what is to be learned) established 
by the administration of a series of objective test 
items covering the subject material to be presented 
during the course of instruction. (THE PRE-TSgT ) 

2. Review of the planped instruction to determine 
whether it will adequately meet the needs and demands 
of the current trainee group, based on the results of 
the Pre-Test. 

3. A Curriculum Audit undertaken duri|f^ the course o^ 
the training, to ascertain how much of the subject 
material assessed by the test items is in fact 

^ covered during instruction* 

^ I 

4. A second administration of the same set of objective- 
form test items at the end of the training sequence. 
( THE POST-TEST ) 

5. A comparative statistical analysis of Pre-Test artd 
PosV-Test results to assess the effects of training 
on levels of subject competence.- • 

> * 

The administration of the test instrument provides data which 
can be broken down into: 

1^ Data on the training 

, a. Total test resnlts 

b. Results on subsets of items 

c. Results on individual items 

2.^ Data on trainee test performance ^ 

a. < Total trainee group ^ - 

b. Trainee subgroups 

c. Individual trainees 



Training Evaluation Hierarchy ' 

It should be emphasi^d here t^at the assessment of the impact 
of training on subject matter competence is only the first 
level of training evaluation. The ultimate objective is to 
determ/n* how effective the training has been in increasing 
the oriW:he-job capabilities of trainees. Between this ulti- 
mate level and the more immediate level (assessing competence)^ 



^ 15 



4 



are several intermediate levels of evaluation. These levels, 
when listed in terms of time sequence, (realization of objec- 
tives at increasingly longer range) and measurement complexity, 
form a training evaluation hierarchi^, beginning with the level 
dealt with in this Manual: 

1. Co^Mtive Competence : How much learning (in terms of 
ability to use and apply relevant subject matter) can 
be said to have occurred as a result of training in- f 
struction? Measured by objective-type tests adminis- 

% tered to trainees before and after training. 

2. Relevant Attitude Change : How has training modified 
the attitudes of the trainees about the subject matter 
or about their, jobs? Measured through structured 
attitude scales, projective tests, or other special 
test methods (e.g., the Semantic Differential Tech- 
nique) . 

3. Short or Long-Term Retention of New Learning : How 
much knowledge and understanding of the subject mate- 
rials do trainees retain after selected periods of 
time? Measured by delayed re-administration of the 
original test or administration of a comparable 
instr^ament. — 

' 4. Subseguent Job Placement : To what extent is the joia 
situation of trainees relevant to the nature of the 
training program? Measured by structured follow-up 
intezrview or c^estionnaire. 

5. Assigned Jog ButXBS and Functions : Is the training 
content riTevant to duties subsumed under the train- - 
ee^'-s^work role? Measured by structured follow-up 
intezrview^ questionnaire, or observation. 

•6. Work Performance : To what extent does the worker 
employor^JaUto employ knowledge and skills ac- 
quired during training? Measured by structured 
follow-up observation. • 

7. Staff/Client ^lationships ; What effect has train- ' 
, • ing had on wotkers ' subsequent interaction with 

those around them (i.e. , .workers, c;iients, patients) 
in the working environi^ent? Measured by structured 
J questionnaire or intezrview with traifaees and others, Z 



or by on-site obseirvation. 



ERIC IG 



I 



5 



8. Overall Job Effectiveness : What significant con- 
tributions can be attributed to training in terms 
of increased capacity to meet job demands or to 
attain goals established by the wor^k role? Measured 
by impact on achievement of work obj^tives. 

This Manual is one of several proposed to describe the de- 
velopment, application and interpretation of instruments, 
techniques and designs for assessing the effectiveness of 
structured training programs at each of these levels. It 
has been designed to be a self-contained reference source, 
including the basic information needed to plaji, administer 
and analyze an objective test instrument under a Test/Re- 
test design, and gives additional information for specific 
situations. * 

Although primarily written for administrators of training 
programs concerning population, family planning, and' Maternal/ 
Child Health, the Manual may be used in a variety of train- 
ing situations by individuals whose knowledge of and ex- 
perience with educational assessment methods may vary. Its 
design is .thus intended to be specific enough to permit un- 
ambiguous application of the methodology in specific train- 
ing programs and flexible erkough to be applicable to a 
variety of settings. No attempt has been made to create 
an actual test instriuiient that can be lifted directly from 
the Manual. Items written into a test instrumen t will 
depend on the subject material specific to a particular se- 
quence of instruction and must be designed by those vho ac- 
tually conduct the training . 



< 



ERIC 



^ - 17 




A Note on Terminology 



Cognitive Competence = Subject (matter) Competence: 
a learning outcome of a structured educerElonSl: ex- 
^ perience involving the acquisition and mastery (in 
terras of substantive knowledge, understanding and 
application) of the subject material imparted diyring 
a sequence of instruction. For. purposes of assess- 
ment, the operational definition of competence em- 
phasizes usage, adaptation and application of the 
material learned rather tijan simple recognition or 
demand recall of the material at a later time. 

Instructional sequence = educational input = se- 
quence of training: the systematic imparting 
through structured lectures, seminars and/or recit- 
ationsT, of subject material of a fRLghly specific 
natur^ Implicit in the definition is the fact that 
such learning experiences are directed toward pre- 
determined educational objectives. ^ 



Test - Pre-Test: the administration of an achieve- 
raent-meafeuring instrument at the beginning of an in- 
structional sequence. 

Re test = Post-Test: the administration of an achieve- 
ment-measuring instrument at the termination of an 
instructional sequence. 

Items = Test Tasks: individual test questions or pro- 



6. ^tem Sel;;^^ Subset = Subtest: the subdivisions or 
grouping of items, each subdivision corresponding 
to a separate subject matter area. 

7. Composite Test = Total Test: the total number of 
Items comprising the complete instrument; the siira 
"bf the Item Sets. ^ 



J5lems* 



ERIC 



\ 



CHAPTER II 
THE TEST INSTRUMENT 



Designing the Test InstriAnent; Overview 

A valid assessment of educational achievement is the result 
of a systematically controlled succession of steps beginning 
with the identification of relevant objectives, continuing 
through construction and administration of the assessment 
instrument, and ending wibfe scoring, analysis, and interpre- 
tation of results. 

A major purpose of the training process is imparting substan- 
tive subject matter to trainees.. It is safe to assume that 
the training instructors desire that the trainees acquire full 
comprehension of the scope, applications and limitations of 
the more significant subject matter. In order to assess the 
extent to which this general learning outcome is achieved, it ' 
is first necessary to translate it into components, which in j 
turn will be translated into performance variables that can be| 
observed and subjected to objective, quantitative measurement., 

The content of most subject areas covered in training courses ^ 
consists of methodology, facts, theories, problems, and points' 
of view. In most t^aiftitig programs the emphasis for the ^ 
trainee is on developing competence in subject content usage 
and application rather than on content recognition and recall. 
This is because what the trairyee is able to do with the subject 
material will contribute more toward his subsequent "on-the- 
job" effectiveness than will simply being able to rememi?er it 
on demand. Thus for purposes of assessing trainee achievement 
and training impact, subject matter competenc e is defined as 
the trainee's expected ability to perform specific op erations 
on, and make specific application of, the subject material 
that was encountered during a sequence of training instruction . 

There are a, number of ways to interpret operations and ap- 
plications in terms of expected trainee behavior. One com- 
mon approach, for example, is to classify them in terms of 
the cognitive functions that contribute to those behaviors — 
mental processes such as concept formation, inference, analysis 
synthesis, abstract reasoning, critical thinking, etc. How- 
ewet, the types of behavioral .processes involved in this 
classification are too numerous and functionally interdependent 
and do not readily translate into well-defined test tasks. 

As an alternative approach, some type of classification of behav 
ioral learning outcomes should comprise the domain of ^ubject 



ERIC 



19 

7 



8 



matter competence. The desired learning outcomes would then 
be defined in terms of overt performance on specified test 
tasks. This will provide a most effective approach to the 
measurement of subject competence since most of the behavioral 
correlates of achievement can be clatssified into one or another 
of several, categories. On 'this basis, a test item would be 
classified, for example, as one that contributes to the con- 
clusion that the examinee "know^terminology and vocabulary," 
"knows concepts and principles ," -"c^jn apply generalizations and 
principles to new situations," "can make valid evaluative judg- 
ments," etc. (1) Test construction guided by a classification 
such as this will direct the focus away from simple knowledge 
of definitions and facts .to encompass a greater range of more 
complex cognitive behaviors, y 



Achievement Areas 

The test instrument should be designed to assess a broad range 
of behavioral learning outcomes, from simple acquisition and 
comprehension of terminology, facts, and principles to higher- 
level abilities involving the application of what was learned 
to new problem-solving and decision-making situations. The^ 
seven achievement areas sj^ecified by Ebel (2) can serve as the 
basis for designing test items to effect this type of assess- 
ments 

The seven achievement areas and the types of items that can be 
designed to assess them follow:* 

1- Understanding of Terminology ; Items designate terfhs to 
be defined or otherwise identified. The examinee is provided 
with a word or words and asked to select the correct or best 
definition from among several alternatives (e.g., "What is an 
ectopic pregnancy?" or "The demographic transition is a term 
that describes . : ."). These are probably the simplest types 
of objective items to design. 

2- Comprehension of Fact or Principle ; Items are based on 
descriptive statements of the way things are. The examinee is 
asked to Select from several alternatives the response that 
best completes a statement, that best answers a question, or 
that otherwise shows a grasp of the basic facts and principles 
of the subject matter at hand. (e.g., "The interrelationship 
between ovary and pituitary during the menstrual cycle cajt* 
accurately be described as one in which . . "What is the 
basic principle underlying the rhythm method of conception * 
control?") 



* See pp.^. - and Appendix C for inf6rmatidn on item con- 
struction 



V 



0; 

3. Ability- to Calculate : Items requii5e use of inatheinatical 
processes to get from the given to the reguirec3y quantities. 
The exciminee is provided wi?th a well-defined computational pro- 
blem together "with a set of alternative answers. One example 
of the type of quantitative item employed in the area of family 
planning and population is, "Out of 200 clients initially en- 
rolled in a family planning program, only 158 remained active 
one year later. What is the annual dropout rate?" 

4. Ability to" Explain or Illustrate : Items generally con- 
tain the words "why" or "because." This type of item has two, 

'forms. The examinee is either asked to select, from the alter- 
natives given, the one that best explains or provides the best 
reason for the existence or occurrence of the specific situa- 
tion cited in the item stem (e.g., "If estrogen alone and pro- 
gesterone alone can successfully prevent ovulation, why is it 
necessary to administer both under the combined method of oral 
contraception?") or, the examinee i,s asked to select the alter- 
native that provides the correct or best answer to tj>e questidn 
posed in the item stem and, at the same time, to justify the 
answer selected, as in the following example: 

When a spermicidal preparation is the contraceptive 
method employed, should douching be postponed foj; 
at least six hours following coitus? 

a. Yes, because douching within a few hours following 
coitus may either remove the spermicide or dilute 
it to the point that sperm will survive in the 
vagina. 

b. Yes, because ii^rigation of the vagina within a few 
hours following coitus will force large niambers 

of live sperm into the fallopian tubes, increasing ' 
the risk of conception. ^ 

c. No, because the douching agent will increase the 
effectiveness of the spermicidal barrier within a 
few hours after coitus, ensuring greater contra- 
ce^ptive^protection . 

d. No, liecause spermicides lose their effectiveness 
within three hours following coitus, allowing live 
spem to remain in the vagina unless removed imme- 
diately by douching. ^ 

5. Ability to Predict ; Items are based on descriptions of 

. specific situations. All conditions are given and the examinee 
is asked for the future result -- i.e., to select from among 
several alternatives the most likely outcome, as in the follow- 
ing example: 



ERIC 



10 



If the Crude Birth Rate of Hong Kong were reduced 
immediately to the level of the Crude Death Rate 
(i.e., 4/1000) and held at that rate indefinitely, 
asstuning nor net migration, what would be expected 
to happen ta.the population? 

The population would cease to change, remaining 
steady at the current level with zero growth. 

b. The growth rate would decline, but the popula- 
tion would continue to grow more and more slowly 
for several decades. 

^ c. Population numbers would decline at an accelerat- 

ing pace until the population virtually disap- 
peared. 

d. The growth rate would cqminence to oscii^ate 
between positive and negative. 

6* Ability to ^ Recommend Appropriate Action : Items are 
basecj on description of specific Situations.^. Some conditiqas 
-are'^given and the trainee is asked to provide by selecting 
frdm among ^ several alternatives other conditions or actions 
that will lead to a specified result. For ^xample: ^* 

Since the init^l insertion of an intrauterine 
device can seriS^usly damage a developing embryo 
(from an undetec^§« pregnancy) the saiest time to 
insert an lUD i-s \ - 

a. just before the expected menstrual cycle. 

b. during and immediately after menstruation. 

c. only during menstruation. 

d. during the time at which ovulation is expected 

to occur. 

7. Abi3rjrty to Make an EvaluatiSSre Judgment ; The types of 
items assessing this level of subject competence involve re- 
sponse options which are statements whose appropriateness c?r ' 
quality is 'judged on the basis of specific criteria presented 
in the iteln stem. For example; 

Which of the following ratios provide the best 
indication of the overall mortality conditions 
in a developing^ country? 

a. The Aumber of infant deaths in a year per 1000 
live births in that year. 

b. The numh^er of deaths per 1000 in one year over 
the total; population at mid-year. 



11 



c. Deaths to perjsons over 50 years 'of age in a 
year over the total number of deaths in that 
year.* 

d. Number of deaths in a year to persons 70-74 
years of age per total number of persons aged 
70-74 years at mid-year. 

'It should be noted that the^ acquisition of higher level abili- 
ties (areas 4-7) depends on achievement in the first three 
areas. T?hat is, it is necessary for the trainee to acquire 
a certain fund of information (facts, principles , computational 
skills, etc.) with a higher 'degree of comprehension before he 
can adapt and apply this new learning to practical situations. 
Therefore, items desiglfed to assess higher-order abilities will 
p]^esuppose the trainee's achievement at the lower levels. The 
test instrument should contain items assessing the first three 
areas as well, ^however . -In the later analysis of test data it 
may be discovered that some of the items assessing high-level 
abilities were missed because the trainees §id jiot achieve at 
the lower levejs. For example, trainees may have done poorly 
on an item designed to assess their ability to predict because 
they didn't understand the basic principles required, or were 
unable to mal^e an essential calculation. 

It is strongly recommended- ^that when designing a test blueprint 
all the behaviors that will apply to the specific subject mate- 
rial be included". % "Not 'all of. £hem, hbwever, will be apnlooafele 
to every course of training. The relative importance o| each 
of i:hese behavioral outcomes as objectives of instruction will 
vary from program to program. For eXampl^z ability to calcu- 
late is an appropriate lemming outcome xo be expected from a 
statistical training course, but would not be a relevant out- 
come in a training program where the focus was on subject areas 
such as contraceptive technology or the anatomy and physiology 
of hximan reproduction. The final decision as to which of' the 
seven behavioral outcomes above constitute relevant course 
objectives will have to be made by* the training staff. The 
decision will be a subjective judgment based on an analysis of 
the specific subject matter to be covered during the sequence 
of training, as outj-ined in the curriculturrplan. However, 
since the use and application of learned material constitutes 
the primary domain of subject competence, it is recommended 
that test items assessing abilities should have greater repre- 
sentation on a test relative to items sampling simple under- 
stand3.ng of terminology, fact, principle, etc . 

One final comment about constructing items as'sessing the seven 
achievement, areas r It may not always tie p6ssible to write 
pure items — i.e., items assessing Only one of the areas to 
the exclusion of the other six. It may sometimes be the case 
that an item will call into play a number of separate abilities # 

o • 23 



ERIC 



12 



each of equal importance. This is quite v^lid in terms of 
the instrument being proposed here. These s^ven achievement 
areas should not be considered mutually exclusive categories. 
It is quite acceptable to write an item assessing one or more 
areas, as long as all areas are given representative coverage. 
The major purpose of the above discussion was to illustrate 
that an objective achievement test need not be confined to 
simple recognition/recall tasks, but can be so designed to 
assess more sophisticated, higher-order cognitive processes, 
the typ^s of higher-level processes considered to underlie 
s\ibject matter competence. 



Subject Content Areas 

Once the types ^f test behaviors that the examinees are re- 
quired to demonstrate in an assessment of cognitive compe- 
tence have been specified, the subject content areas to be 
covered by the test items should be determined. The content 
dimension is very impt3w::j:ant to the proposed assessment since 
It is through the course content that the behavioral out- 
comes are taught and through which they are demonstrated. 
As will be discussed later, the subject content to be 
assessed by the test instrumei/t can be derived from the 
curriculum plan. Like the classification of behavioral 
learning outcomes, the course content should be arranged 
(for testing purposes) as a detailed outline of a limited, 
finite number of discreet subject matter categories. This 
can be done by taking each proposed training session in the 
order defined by the curriculum plan and listing the major 
topics and subtopics to, be covered,. When completed for all 
sessions, the test designer will have a complete listing of 
all the subject matter being proposed for presentation. How 
to employ this list, together with the classified behavioral 
dimension, in the construction of specpLfic test items is the 
svjbject of the next few pages. 



Two-way Item Specification Table 

The behavioral learning outcomes ^nd the content of in- 
struction represent the two dimensions which underlie the 
test plftn. Once ea'ch has been specified, as shown above, 
they should be combined into a framework which will serve 
as a guide to the development of the test instrument. This 



framework, or Item Specification Table, will serve as the 
test blueprints. 

This blueprint, while a practical guide to test construc- 
tion, is al'so a theoretical outline of what constitutes 
competence with the material to be covered during instruc- ^ 
tion. That is, it specifies which behaviors an examinee 
must demonstrate in which specific subject areas, in order 
for him to be considered as having attained a hi^ level of 
s\abject matter competence. Properly constructed,', the 
blueprint will illustrate not only which subject areas 
are to be covered by the test items, but also which of 
the various learned behaviors are to^ be expected from 
each area, and will indicate the relative weights assigned 
to each subject arear and learning outcome, in terms of 
the number of items to be constructed. 

The Item Specification Table is a two-way table that re- 
lates specific subject content to expected learning out- 
comes. A table of this type is easily constructed for 
any sequence of instruction by designing a two-way grid 
with the subject content areas listed along the vertical 
axis (left side) and the behavioral learning outcomes 
listed along the horizontal (top) . Table cell entries 
will consist of check marks or some other code designat- 
ing the niimber of items to be constructed. An example 
of the general format for an Item Specification Table 
is illustrated in Figure 1. (The numbers, in parentheses, 
in each of the table cells indicate the percentage of 
total items to be devoted to each behavioral outcome with- 
in each content area.) Althou^ it was designed for a 
statistics training sequence, the basic format is applic- 
able to any type of training. / 



The importance of employing such a table as a blueprint' for , 
constructing the test instrument becomes evident when the 
concept of achievement testing is considered. Any achieve- 
ment test IS a work, sample. That is, the aggregate of items 
^that comprises a test covering specific subject material is 
only a sample drawn from some hypothetical universe or 
population of all possible items that might bemused to 
make up such a test. In the assessment of some curriculum 
areas, the population of potential items is limited — 
for example, an* elementary school class whose spelling 
competence with five hundred words is being assessed. 
However, for some test situations there is ajjnost no 
limit to the numbfeir, of potential test items' that could 




14 



< \^igi)re 1 

SAMfLE ITEM SPECIFICAl'ION^TABLE FOR THE STATISTICS COMPONENT 
OF A TRAINING PROGRAM IN QUANTITATIVE RESEARCH METHODS 



Measurement 
c Scales 



r 




( 1 ) 



( 1 ) 



( 1 ) 



( 1 ) 

#5" 



( 1 ) 
# 3 



Frequency 
Distribu- 
tions 
(ungyrouped 
St grouped 
data) 



( 2 ) 



( 1 ) 
^1 



( 1 ) 

#5 



( 1 ) 



Measures or 
'Central 
Tendency : 

Mean, Median 
c Mo(le 



( 1 >• 



( 1 ) 



( 1 ) 



( 1 ) 



( 1 > 



Measures of 
Dispersion: 
Raoge, Seini- 
Interquart. 
Range, Std. 
Dev. c var. 



' ( 1 > 



1 1 > 



:( 1 ) 

#A0 



( 1 ) 
#19 , 



( 1 ) 



Probability 
San^lmg c 
Sampling 
Distribu- 
tions 



( 1 ) 



IS 



#21 



( 5 ) 



( 1 ) 

^21 



( '3,) 



15- 



( 3 ) 




ERIC 



. 2G 



4 



15 



be constructed — for example, the number of qualita- 
tive and quantitative items that could be constructed 
to cover a sequence of training in statistics. In 
most training courses of the type for which this Manual 
has been designed, the latter case will probably be 
more common. 

Where a test instructor has no finite, discrete list 
from which to select the item sample to be used, he is 
faced with the problem of constructing an aggregate of 
items that will be an adequate representation of the 
total universe of items that would be appropriate for 
botji the subject matter and the behavioral learning 
outcomes. 

The purpose of* the te^*^nstrument is to provide ob- 
jective data for making^pferences about the extent 
to which a sequence of trainiing increases the levels of 
competence of trainees — demonstrated by certain cog- 
nitive behaviors — in specific subject areas. Suc^ 
inferences will be valid only to the extent to wh4^h 
the test instrument provides a representative sampling 
of potential items reflecting the entire domain of sub- 
ject material covered during instruction and all of the 
expected learning outcomes. Without a carefully developed 
test plan,) -ease of construction all to frequently becomes 
the dominant criterion in selecting and constructing test 
XXfTCLS (Z)L That is, items measuring simple knowledge 
(essentially recall and recognition tasks) , because they 
are easier to construct than those assessing the paore 
complex learning outcomes, might be ,over represented on 
the test. As a result, the test might end up assessin-g \ 
a limited ^nd biased sample of behaviors and subject 
^ content areas, neglecting those that might be mor6 rele- 
vant to the objective under consideration, namely subject 
matter competencGT. 1 ^ , ^ , 

The Item Specification Table will facilitate the ptocess 

of planning both the types and numbers of items tol ensure ^ 

an appropriate, fair,^^g^ representative sample written 

into the test instriSSent. | 

In constructing the Table, the seven behavioral learning 
outcomes are listed regardless of the specific instruction 
content. However, because not all of the sxibject matter 
covered during instruction will be important enough to be 
assessed, some procedures must be employed by the training 



ERLC ^ ^ 



16 



. staff for selecting thfe siibject matter that will be entered 
into the table- cind thus sampled by the test items- In 
many training situations i some siibject material serves 
as a background or 'prerequisite for other more importajit 
material. This initial material is importcuit only to the 
extent to which it facilitates acquisition and understand- 
ing of the more relevant subject material and would not be 
the focus of direct assessment. For example, in a^course 
on statistics, it would be necessary to introduce the stu- 
dents to the concept of probability before proceeding to 
such topics as hypothesis testing, inferenc^nd sampling 
techniques. While the student must have an understanding 
of the laws of probability, it is their application to the 
area of statistical inference that is important. The focus 
of assessment would therefore be on statistical ipference 
directly, with little attention given to probability in 
itself^ / 

tn situations where the subject matter universe to be sampled 
for testing purposes is not composed of all the material 
covered during instruction, the training staff must select • 
from the list of content areas (previously drawn from the 
curriculum plan) those areas which should be included in » 
the assessment of competence. Then, each staff member 
would be responsible for selecting the most importamt 
topics and subtopics from his area of expertise. Each steeff 
member shQi^d base his judgment on. the criterion of rele- 
vance to competence — which material, out of the entire 
ramge of material comprising the particulaJ: siibject area 
covered, should the trainees be capable of dealing with 
(in terms of use and application) in order to be considered . 
competent in that siibject area? / 



Once the re 
the subject 
be entered 
dimension, 
specifying 
eluded in a 
assessment 
outcomes. 



levant material has been selected [from all of 

areas under assessment, the resulting list will 
into the item Specification Table* as the content 

The completed Table, will then be/ employed in 
both the t^es, and ^umbers of items to be in- 
specific test instrument to enstpe a balamced 
coverage in terms of 'content arec s and behavioral 



Defining Relative Test Emphasis 

Prior to the construction of test items it :.s necessary to 
determine what proportion of the total item^'on the proposed 
test should be constructed for each subject content area and, 
within each content area, for each of the behavioral learning 
outcomfes. 



ERLC 



' - 17 



There are no hard and fast rules for allocating a test item 
(or items) to a specific subject content area or to a spe- 
cific behavioral outcome within a subject area. 

One approach that ife commonly used, and is suggested by a 
number of testing specialists; is to allocate certain 
numbers of items to subject areas on the basis of the amount 
of time devoted to -each during the course of training. Unde 
this system, for example, if the curriculum plain for a ten 
week training program specifies that subject areas A cind B 
are to each receive 4 weeks coverage, with only 2 weeks 
devoted to topic C, then the proportion 6f test items to be 
allocated to topics A, B cind C wouj.^ be 4C%, 40% cind 20%, 
respectively. Or, if the plan delegates thx*e training 
sessions to coverage of subject X and only one session to 
subject Y, the number of^test items dealing with subject X 
will be three times greater than those dealing with subject 
Y. Similarly, the number of litems designed to assess the 
^different behavioral learnrog out9omes within the various 
" content areas woul^ be basea^ on the emphasis placed on each 
dtiring the training. With the training focus on use and 
application of the subject mciterial presented, the assess- 
^yment instrument should" emphasize the same behavioral ob- 
jectives (i.e., behavioral outcomes three through seven, pp. 87IO) . 
\The training staff, employi^ig the yt^m Spepif ication Table 
.would see to it that the relevant behavioral objectives and 
.subject areas are adequately sampled by the' test item^ in 
{ proportion to the emphasis given each during the course of 
^ ^instruction . 



T>iis approach is appropriate where only the most impo3rtaiit 
subject matter within each area is included in the ncpulation 
from which the |ample of test items is to be drawi\ (i.e., the 
areas denoted the Item Specification Table)., That is,^ 
when prerequisite or other preparatory material is not being 
considered for assessment, despite the fact that training 
time had been d&^4Qted to it^, only the important content areas 
will appear on the^ vertical axis of the Table. The staff may 
then use the time-based approaiph to determine the number of _ 
items to be**devoted to each of 't^ose areas. 
• 

However, a major difficulty arises with a time-based approach 
where some subject material is simply more difficult to get 
acrass to a training group than other material. In such a 
situation, the subject areas that happen to be more difficult 
to present may not be more importcuit or more necessary to the 
training ob^jective under assessment (i.e., subject mattet 
competence) , but more time will have to be devoted to their 
coverage — * thus qualifying them for a larger proportion of 
test items than others that may be equally important. ^ As 
this is a fairly cctomon case in training courses, a time- 
based approach to the allocation of items to content areas 




2J 



18 



does not guarantee that the test item content will 'be a 
representative sample of t/ie subject content actually covered 
during instruction. This imbalance may well be further re- 
flected in the Curriculum Audit (see Chap, v) . At the end 
of the training sequence, it may be noted that much more time 
was devoted to some subject topics than others, although this 
did not reflect the importance of the topics, as determined 
in the curriculum plan. Where difficulty of certain content 
areas has been controlled for in the curriculum plan by allo- 
cating more time to coverage of these areas, this should not 
automatically be reflected in the construction of the toting 
instrument , ^s would happen if the amount of time alone is 
the criterion for determining the number of items. It is 
not the difficulty of, the sulpject material that determines 
the number of items necessary to properly assess how compe- 
tent trainees, have becomerwith it, but its importance in 
terms of the training obtectiveF ] ' 



An alternative approach would be to have the training staff 
determine the relative importance of (and "thus, the relative 
test emphasis to be given) each^ behavioral otttcome and each 
topic within the major subject matter areas to be covered 
during instruction . Since the training staff is responsible 
for both defining the curriculum plan and instructing within 
selected subject areas, the staff members should then be the 
group most gualif ied for setting the standards of competence 
in their respective subject areas. (When the training staff 
includes outside instructors, as discussed on pp. 32-34. 
those instructors should also share the responsibility in 
determining the relative* importance, for testing purposes, of 
the behavioral outcomes and subject topics under consideration.) 

The allocation of test items by the training staff will be 
based on subjective judgments, since, in most <:ases, there will 
not be objective criteria for determining wl\at constitutes 
competence in a specific subject area. Such criteria will 
usually be set by the staff members, drawing upon their own 
expertise in a particular subject area. The guiding principle 
underlying "the allocation of test items «ir± be that the test 
items, in number and content, should maintain the same relative 
cover ag^^ important subject areas and behaviors that the 
training >^aff will try to achieve through instruction. 
Estimates of relative importance will be expressed as per- 
centage weights to be recorded in the Item Specification 
Tcible . 

In the example (see p. 17) where three subject areas are 
covered during a ten week training program, a time-based 
procedure for weighting these three areas would result in a 
40%, 40%, 20% allocation of test items (given that the in- 



ERIC 



5 J 



19 



struction time devoted to the areas was 4, 4, and 2 weeks, 
respectively) . Assume that a total of 4 weeks (out of the 
8 weeks of instruction assigned) will be devoted to coverage 
of prerequisite material in both areas A and B. Now, even 
though different amounts of time will be devoted to instruction 
in these three areas, the staff might judge them to be of 
equal importance in terms of the trainee's post-instruction 
competence. Thus, each subject area would be assigned per- 
^centage weight of 33 1/3% cind the areas would be allocated 
equal numbers of test items (rather thcin the 40%, 40%, 20% 
item allocation set by the titoe-based approach) . 

In the cibove example, Vt was the professional judgment of the 
staff that the three subject areas are of equal importance in 
terms of the material that the trainees should be competent 
with. Iti this case, the time-base.d approach would not have 
been adequate fOT ensuring a balanced test coverage of the 
relevant co^t^nt of instruction. 

After the initial weighting, \he major subject areas will be 
broken dowii intoj a number of discrete topics in order to 
prtS^Ade a brp^det subject bafee from which to construct test 
itei&^ A hypot^hfetical example will outline the procedure • 

A sixteen week training program, "Quantitative Methods 
in Health-Related Research" is composed of three major 
subject areas: 

1. Descriptive and inferential statistics {S weeks) 

2. Techniques of demographic analysis (4 weeks) 

3. Survey design (4 weeks) 

Although the amount of teaching time to be devoted to area A 
is twice that, of ^ither B or C, the staff judges areas A anS 
B to be twice as important as area C; therefore, each will 
receive an assigned w|eight of 40% while C receives 20%. 
Thus, 80% of all itemfe t6 be constructed will be divided 
equally between areaslA and B with 20% of the items allocated 
to assessing competence in area C. 

The staff members wittt particular e^^pertise in these major ^ 
areas will decide whicn of th e topics withi nl these areas 
should be included in the subject verse to be assessed 
{selection to be based[upon th^.,>felevance to competence" 
criterion stated ,on p. \l6) . The material selected will then 
be recorded, together with the list of behavioral learning 
outcomes, in the Item Specification Table. 

A partial listing of the relevant sxibject matter comprising 
the program's statistics training component i|s illustrate^d 
in the Item Specification Table in 'Figure 1. (The content 
listings for ^the other two areas would follow the same format 



20 



and could be recorded below the statistics section on the same 
table. However, given the length of combined listings from 
several subject "areas, it might be best to break the Table 
' up into sub-tables, one for each subject area.) 

Assigning weights within subject areas Once the most impor- 
tant topics have been delineated for each major subject area 
(and recorded in th.e " table) , another series of judgments 
must be made by the staff. That is, the decision must be 
made as to how the total percentage of items allocated to a 
major area is to be distributed am9ng the topics comprising 
that area. The allocation of percentage' weights to these sub- 
areas will have to be based primarily upon subjective staff 
judgments as to the relative importance of competence in 
each sub-area to overall competence lin the major subject 
area. (The weights selected will bej recorded in the Table's 
right-most column, labelled "total % items.") For example, 
the staff decided that competence with "Probability Sampling 
and^ Sampling Distributions" will contribute more to overall 
statistical competence than will competence with such material 
as "Measures of Central Tendency" pr "Frequency Distributions." 
The relative percentage weights assigned to each of these 
topics (i. e., 15%, 5%, 5%, respectively) are the result of 
tj^ese decisions. 

Selectin g appropriate behavioral learning outcomes The next 
set of staff decisions involves selecting the Cognitive^vbe- " 
havioral outcomes most appropriate to the^ubject content to 
be assessed. As pointed out earlier (e.g/^-see p. 1\ the 
one objective common to most training programs jls to develop, 
in the trainee, the cognitive competence to use and to apply 
the subject material learned, not simply the ability to re-' 
member the material on demand. The te^t instrument designed 
to assess competence at this Ipvel should therefore focus on . 
tne more complex learning ouCcomes with less emphasis on 
sample acquisition of knowledge. 



The test desi^* should require the trainee to demonstrate 
behaviors in the upper achievement ^reas (i.e., areas 3-7), 
listed on pp. 8-10 with less stress on assessing those cognitive 
behaviors witliin areas 1 and' 2., Of course, not ali higher 
level learning outcomes will be appropriate for each and 
every subjept topic (e.g., the ability to calculate would 
be an Inappropriate expected behavioral outcome in a seminar 
on human reproduotive anatomy and physiology). However, 
the general procedure should be to consider the applicability 
of the higher order achievement areas before considering tKqs^ 
lower on the list. The relative emphasis provided each of 
these behavioral areas wheA covering a certain subject during 
instructio'ii' will help guide the allocation of items .to specific 
behavioral outcomes. The percentage weights assigned 
to a, particular behavior for a specific subject topic will 



21 

be recorded in, the appropriate cells of the Item Specification 
Table* (The numbers, in parenthes^gfe^ithin each cell, repre- 
sent these percentage weights •) '^^ 

It should be noted here that the table indicates the relative 
importance of each cell in terms of an assigned percentage 
weight/ not in terms of the actual number of items to be 
assigned to each cell. Relative numbers of assigned items 
will depend upon the decision of how many items will comprise 
the total test. (This decision point .will be discussed in 
the next section.) Before taking up the question of deter- 
mining optimum test size, one further .step must be considered. 

In order to obtain a more reliadDle estimate of the numbers of 
test items required, it might be necessary to siibdivide the 
topics (within major subject areas) in^^o s.ubtopics. This 
procedure will help ensure a more balaiiced test coverage of 
, the total subject area. Furthermore, siibdividing will result 
in a number of discrete, homogeneous subject svibtopics for 
which specific test items can be writteb. For example, the 
statistics topic "Measurement and Scales," listed in the table, 
^-cnnhe further subdivided' inta: . ^ 

/ 1. Vari2d>les and constants 

' 2. Discrete variables 

3. Continuous variad^les ^ 

4. Nominal measurement 

5. Ordinal measurement 

6. Interval measurement 
• 7. Ratio scales 

^With this breakdown of topics within sxibject area and siibtopics 
Within topics, \he test designer will have a comprehensive 
blueprint from which the test instrument can be constructed. 

letermininq Total Niimber of test Ttema - 

le number of items to be included in the com^^ted test 
istrment is the last decision that must be made ipJ^ior to 
beginning the task of item construction. / 

Inl msmy testing situations, the niomber of items ^iministered 
is! determined primarily by the amount of time ava(il2d>le for 
testing. Since it is strongly recommended that the adminis- 
tration of the test instrument be untimed (see pi. 42) , the 
factor of time will not put constraints on testjoize. Since 
thifare are no hard and fast rules 2d>out numbers of items to be 
included in a test, the ultimate decision will be a sxibjective 
onejand should involve the entire training staff. It should 
kept in mind*, however, that the larqeiT the number of test 




ERiC — 33 



22 



\ 



items administered, the more adequate will be the test sample 
(in terms of course content coverage) and the more reliable 
will be the scores derived from testing, 

A general guide can be applied when determining the number of 
items to be allocated to the subject matter within major areas, 
When each of the topics comprising a subject area has been 
further partitioned into a number of discrete subtopics 'or 
sub-areas — e.g., the subdivision of "Measurement and Scales" 
into 7 constituent subtopics (see p, 21) — an attempt should 
be made to construct at least one item for as many of the 
sub-areas as is feasible. For example > when the number of 
topics within an area is relatively small (i.e., around 5), 
It might be feasible to allocate an item (or items) to each 
of the major subtopics. 

However, whe;n the number of topics in an area is large (e,g,, 
those comprising the statistical training comporient, fbr which 
only a partial listing of subtopics was provided in Figure 1) 
the size of the Item Set that would result if one item were 
constructed for each subtopic would -probably be too la;:ge to 
be administered effectively (especially, when combined ^^ith 
the Item Setsl sampling the remaining subject areas). In , such 
a situation, (the recommended procedure is to sample within 
subject topicis. That is, instead of one item (or more) per 
subtopic, a balanced coverage of a topic can be obtained by 
constructing an item (or items) for every other subtopic 
listed — e.g,, for the "Measurement and Scales" topic, 
items would be constructed for subtopics 1, 3, 5 and 7, 
Provided that the selection of subtopics for item writing is 
random — i,e., sC±>topics on the li^t are selected by ordinal 
position (every nth one) and not simply because good items ' 
can be written for them — then the resulting item groups will 
provide a balanced and representative sample of both the 
subject content and cognitive behaviors Mo. be covered during 
the course of instruction, Administratiftn of test items, 
selected according to the above procedures, will allow" a valid 
assessment of both trainee achievement and training effective- 
ness in raising the level of subject matter competence. 

One factor that must be considered when, determining the 
total number of items to include in the test is the amount 
of prior experience the trainee group has had in taking 
objective-type tests. Even when the number of test items is 
relatively small, a lack of experience in dealing with ob- 
jective items can lower the validity of the test for its 
intended purpos^, (The students participating in the Rennes 
Francophone Africa FP/MCH Training Program were not familiar 
with the mechanics of objective- type test taking and required 
an average time of four hours to complete the 116 item'^test,) 
Introducing inexperienced trainees to the mechanics of taking 



ERIC 




34 



23 



the objective-type test (prior to .the time of administration) 
will help cut down the time required for test-taking as well 
as allow a greater number of items to be included in the 
instrument. 



\ 



Note ; Given the above discussion*, it is nonetheless possible 
to consider a recommendation for a reasonable ceiling on the 
number of items to be included in the test instrument. In 
order to do this, another factor, the time allocated to 
testing, must be taken into account. 

In most testing situations, the nuiiber of items administered 
is determined for the most part, bj the time limits imposed 
on the testing. For reasons to be discussed (see p. 42) , 
it is recommended that the test instrument being proposed here 
be administered without rigid time limits. The only time con- 
straint suggested is that each administration of the test be 
con^leted at one sitting so as to irinimize the use of train- 
ing time for testing purposes, and to keep the pre- and .post- 
testing situations as uniform as* possible. Breaking up the 
test into two or more s\ab-units to be administered on succes- 
sive days is possible, but not recommended since this would 
involve the trainee iQ an excessive number of time-consuming 
test-taking situations (i.e., possibly four or more separate 
sessions for Combined pre- and postl-testing) . Placing too 
much emphasis/ on testing and assessment might have a negative 
effect on trainee morale and therefore dec^e^se the effective- 
ness of the training experience. (This i6 especially true if 
the trainees are middle to high level^ professionals — e.g., 
FP/MCH physicians and program administra^tors — who might 
resent being subjected ^o too much testing and personal 
assessment.) 

It is best to administer the test during an afternoon session 
so that the test period can 'extend beyond the session to 
allow most of the examinees to complete the test. (Although 
the test will be un timed, it should be untimed only to the 
extent that ^0-90% of the examinees finish the total instru- 
^nent? a small^ percentage of examinees will always take as 
much time to complete the test as they 'are given and there- 
fore have to be limited in the amount of time they can take.) 

Based upon our experience in constructing test instrujpehts 
for the assessment of training in the area of family plaAftin^ 
and maternal/child health we feel that we cail recommend that' 
(for most testing situations) an instrument m^ide up of 150 
items , (maximum) shouldf be adequate. If several item formats 
'(e.g., simple and complex multiple-choice, interpretive 
exercises) are included with more or less equal frequency, 
a test of this length would require approximately three hours 



24 



to complete. (This is based upon an average trainee's read 
and response time of 30-45 seconds f6r a simple multiple- 
choice item; 60-90 seconds for a complex faultiple-ohoice 
item, and 60-120 seconds for an item requijrAng computations 
and problem solving, or based upon an interpretive exercise.) 
Even allowing for a break, three hours is likely to be the 
V.' loanit of most trainees' endurance for test taking, after 
^^^f^^® effects of fatigue could seriously impede* test 
&eyrfcJtmance. if, out of all the subject matter to be 
dovered during training, only the most important piaterial 
^sel€tetpd for assessment (see pp. 15-16 ) , then a test 
c«nposed of approximately 150 items should be adequate in 
providing a representative sample of the learning outcomes 
and subject matter for even a lengthy sequence of instruction 
covering complex material. 

Naturally, the more test items that have been subjected to 
trial testing and item analysis, the m>re confident the test 
designer can feel that his test items are performing, ef- 
fectively for their intended purpose. The creation of an ' 
item file based on the analysis data (see pp. 121-127) will ' 
aid in constructing a test made up of the least. number of 
"^items offering the most valid and reliable sample. 



1 



ERIC 



25 



Use of Objective Items 

The two types of items most commonly employed in tests de- 
signed to measure achievement are the open-ended essay item 
.^atnd the objfictiv6* or fixed-response item. Which type is 
feorst appropriate for use under a Test-Retest design to assess 
the learning outcomes being evaluated? The final selection 
of objective items was based on two essential factors: 

The serious limitation of . the unreliability of the 
scoring of essay-type items (4)., Because scoring « is usu- 
ally based on subjective criteria, subject to the impres- 
sionistic biases of the examiner, it has been proven time 
and again that the same essay can receive different arcades 
from different examiners and even from the sam:e examiner 
at different points in time. To the .degree that a test 
score reflects the private, subjective, unverifiable im- 
pressions and values of one particular scorer, '^,£t^4.s de- 
ficient in meaning and hence in usefulness to ^t^i^^student 
who received it or to anyone ^Ise (£,eJ, the fjfAjuii^g 
^staff ) who is interested* in the ability or ach^ey^effn^n 
. (being assessed) (5) . If a testing procedure Iff ^designed 
to measure change in subject competence over tiina,j then 
an unequivocal, objective scpring procedure is required 
so that changes that do occur can be interpreted as axi 
increase in competence and not due to variations in thfe 
criteria for scoring between the first and second adminis- 
trations of the test. 1 

. 2. Employing objective items affords the opportunity for 
a larger and mcrre representative sampling of relevant 
course subject content than is possible with essay items. 
Only a limited number of essay items can be given during 
any one testing session; thus, it is rarely possible to 
coyer all subject areas adequately, and overemphasis on 
spmq areas of learning and total neglect of others may 
result. V . 

fessay items do, of course, have some decided advantages over 
objective items, especially in assessing learning outcomes 
where originality or writing ability are important factors. 
In terms of ,the 6bj%ctive cited in the' Manual, however, essay 
items pose several problems that overshadow factors favoring 
their tise. The objective item test, presenting far more 
items than essay tests and reciucing all responses to a form 
that cart be easily and unambiguously scored, avoids these 
donfqjanding problems (6) . ' ' - 



* TSi^ terra "objective" item actually refers to the fact that 
^ tKe correct responses are determined at the 'time of item 
construction; this helps to ensure uniformity in assessing 
the correctness or appropriateness of the responses given. 




26 



It should be pointed out here that the methodology also calls 
for the administration otf the identical test both before and 
after the period of instruction. It is possible to employ 
the Pre-/Post-Test design without using the same test both 
times by constructing a parallel form** of the test instrument 
with comparable items. Due to the excessive amount of time 
required and the enormous technical difficulties encountered 
when attempting to construct two sets of items, the method- 
ology here recommended calls for the administration of the 
same set of items on bothj occasions . 

Forms of Objective Items j 

There are several forms of objective items, each .with parti- 
cular strengths and weaknesses and each requiring special 
skill in construction. ^Fpr most applications of this instru- 
ment, limiting the number of different forms of the items 
used to one or two is strongly recommended. This limits the 
number of item-writing skilly th^t the training staff will 
have to master and also eliminates difficulties which may be 
encountered in explaining to trainees the procedures for deal- 
ing with a number of different types of tasks. The two forms of 
^tems recommended, for reasons described below, are multiple 
cfeicgitems and interpretive exercises . Both forms of items 
caxi be^n^^joyed quite effectively in the assessment of the 
impact of i>ts:^uction in the areas of knowledge, understand- 
ing and applic^lsdpn. 



Multiple Choice A i!ivn:iple choice item is one in which a 
question is posed and the trainee asked to choose the correct 
or best answer from a nuihfesr of listed alternatives. There 
are mahy variations on this b^ic format (such as choosing 
the word(s) that best domplete§S3 statement) but 'in all cases 
the trainee is being asked to selb<;t the correct or best 
answer from among >other incorrect oixless appropriate ones. 




is, of course, possible for a person ^fea guess the correct 
lans^r of a multiple choice item, but with only one out of 
[four five alternatives correct, the probability of his 
I guessing all items is very small. Attention to th0 details 
of constructing these items will further reduce the probabi- 
lity of a person's guessing correctly every time. ( Appendii 
C contains a discussion of multiple choice item cbnstructioj 
with examples.) v ' 



Fsrfentially two test instruments differing only with 

fspect to the sample of items selected, the two item 
Sets having been equated (through previous trial test- 
ing) in terms of content validity, difficulty, etc. 



ERIC 



36 



9 



27 



Interpretive Exfespises Interpretive exercises are especial- 
ly useful for measttrdng achievement of the type discussed on 
pp» 8-12« That is, these exercises can best be employed to 
assess, understanding and the ability to adapt and apply what 
was learned' rather than for the assessment of 'simple recall 
and/or recognition • Here the examinee is given material 
(such as a table, chart, illustration or a paragraph of text) 
and is asked a series of questions about it. The examinee's 
task is. to interpret the material in one of several ways. In 
some cases, he may be asked to identify v/hich statements are 
true; in others he may be asked to indica^:e^an opinion about 
a statement's accuracy; or .he may be given a ^ries of mul- 
tiple choice items regarding the material presented. In gen- 
eral, such, exercises! give the student an opportunity to show 
whether he has learned to apply new or old skills to the 
interpretation of tmjfamiliar data^ ^ (Rules for constructing 
interpretive exercises and examples will be found in Appendix 

Other Forms of Objective Items True/false, matching, and 
sl^ort answer or fill-ms are other forms of objective items 
often used. The major argument against fill-ins and matching 
items is the difficulty in arranging an answer sheet to ac- 
commodate them, especially if the scoring is to be done by 
computer. Short answer items must always be scored by hand, 
since the student must write in a word or phrase. Both forms 
are difficult to construct. Matching items may take a ^'long 
time to answer and are, in any case, really only another more 
complicated form of 'multiple choice item. Short answers or 
fill-ins may often lead to ambiguous situations where a number 
of responses* could be considered correct. While it is less 
possible to guess short answer items correctly (sihce the 
^studei^t must supply missing information) , it is nonetheless 
recommended that shbrt answer items be transformed into 
multiple choice items. . , 

The chief argument against true/false items — where a student 
.is ^simply asked to state whether an item is true (correct) -^pr 
false (incorrect) — is that there is a 50% chance of guessing 
correctly on every item. However, there are other reasons .why 
they should be avoided. It is difficult to make a statement 
that may be categorically classified as true ot false and ahy 
distinction such as "more true'^ or "more false" clouds th^ 
issue and makes the item less^ valid as a measure of a student' 
competence. Items must be cptistructed very carefully to pre- 
vent any kind of ambiguity. Again, it is easiest to transform 
true/false items into multiple choice items, reducing the pos- 
sibility of correct, guessing and, at the same time, making all 
items consistent in form. 



28 



General Guidelines for Construction of Objective Items 

Many of the following suggestions may seem obvious > but pro- 
fessionals experienced in reviewing and editing test items 
indicate that the most obvious faults are the most frequently 
comrtiq^tted in the preparation of objective \tests . 

It sho^uld be stressed that PreVPost-Course testing, if it is 
td^ be an accurate means of evaluating botl^ individiml achieve- 
ment and. the effectiveness of a sequence of training ,-anust 
provid^ the maximum Amount of objective data, possible, ^e 
composition of unambiguous aftd untricky t|st items is for this 
r^ason^s^utely essential. Unless skillfully written, \ 
obDectflre items may suffer some of the disadvantages of the 
essay item in that different answers may be of varying degrees 
of correctness. Subjectivity will.be introduced into scoring 
even though the items themselves are "objective." 

It should also be stressed that the test instriament being con- * 
structed is "self-defining" (7) , in the sense that the test 
itself defines what constitutes desired competence. The 'test, 
in turn, is constructed^Ja^^he training staff based upon what 
they consider constitutes high level of .competence in the 
sul^ject areas under assessment. How carefully the test is con-^ 
structed is consequently of utmost importance. Its validity 
as an assessment of , subject matter competence rests on the 
skillful construction of test items, and on the perception of 
the training staff in determining relevant subject areas and 
behaviors. It is for this reason that these aspects of the 
instrument have be^n stressed in the Manual. 

The following 'suggestions and guidelines are provided to help 
avoid the difficulties inherent In the construction of objective 
test Items and to help ensure that the items assess only what 
they were designed to assess. . ^ 

General Guidelines (8) 



Keep the level of reading difficulty low . Complexly written 
items or the use of unnecessarily technical vocabulary can 
put an unfair burden on the test taker and interfere with 
the ability to demonstrate competence in the subject matter. 

2. Do not take items verbatim from books or lecture notes . 
Correct responses to ^such items may be the result of 
recogngition or rote memorisation and may not reflect 
undbtrstanding. In additioriy lifting a sentence from its 
context may change its mez^nling. It is best to para^ihrase* 
material used in^ items. 



ERIC 



4j 



The intended correct cinswer must in fact be correct s 
Whenever a correct answer relies not on undisputed factl 
but on knowledge of opinion or point of view, the source 
of that opinion must be identified (e.g^ "According to! 
Mai thus ' - I 

Be certain that all items deal with significant subject 
content > Do not use items that rely on knowledge or 
understanding of trivialities. Before including any item, 
ask yourself whether it is relevamt to the desired com- 
petence of the trainees in the subject area being assessed. 

Each it^ should be independent of every other dtem ^ A 
correct . cmswer on one item shoxild never be a prerequisite 
for amswering subsequent items correctly. In addition ^ 
items should not provide clues for the solution of other 
Items. • ) 

Avoid recogpizcible pattfsrns in positioning of correct 
responses . Set the position of the correct response in 
a 'random manner, to avoid the test tcikers* trying to out- 
guess you by figuring out the pattern of responses. 

Only one alternative should be the correct or most cor- 
rect response . Allowing more them one answer to be correct 
is confusing to tiest takers, cind in addition turns each 
item into a string of "true/false" statements. It is best 
to allow only one answer to be correct. But make sure 
one is correct. 

Make all alternatives equally plausible to the examinee 
who lacks the understcinding or eibility rec^uxred to ajiswer 
the item i If one or more of the alternatives is obviously 
ridiculous, even to someone who doesn't have any idea of 
what; the correct amswpr might be, the chaiices of his guessing 
correctly are greatly improved. 

When constructing the items according to the requirements 
in the Item Sj^cif ication Tcible, it is best to construct 
more than the^ number planned for the final test i nstxrument . 
Extra items will replace items whiph were initially accept- 
able, but proved to be unsatisfactory (and "could not be 
adequately revised) when subsequently reviewed. A 20-25% 
reserve per Item Set should be sufficient. 

Each item constructed should be recorded on a separate 
5X8 card . This procedure will facilitate the review 
cmd editing of items since it is easier to revise defectiJye 
items and to delete from and add to the item pool when each 
item is on a separate card. Also, singly by sorting the 
item cards, the serial placement ^f items for the final 
instrument can be aurreutged and, ir necessairy, rearramged 
until an acceptiibie order has been achieved. 



In addition to the item and correct amswer/ the content 
area and cognitive behavior assessed by the item should 
be recorded on the card. This information can be checked 
against the cells of the Item Specification Table to 
determine if the final test item pool does, in fact/ 
provide the balanced coverage outlined in the Table. 

Finally/ if an item analysis is conducted (see pp. 121-127) 
the results for each item should be recorded on the card. 
This data will help in the compilation of an effective 
item caxd file that can be used in a future evalviation 
study. 

Test items should be constructed from two to four weeks 
prior to the time of the first Pre-Test/ put aside for a 
period of time cind then critically reviewed for defects . 
This procedure will/ among other^hings/ help to \incover 
ambiguities and inconsistenciers (in sxibject content/ 
grammar/ vocaiDulary/ spelling/ ita which were initially 
overlooked. Whenever possible/ Yndependent staff members 
familiar with the subject material should be called in to 
review and criticize the items./ help revise defective 
items an.^ ^select replacements from the reserve sets. 

Specific attention should be given to such factors as the 
appropriateness of the test content as well as of the 

co^itive behaviors called for within content areas and 
the accuracy of the scoring key. 

In addition to assessing whether each item adheres to the 

guidelines provided above (i.e./ nos. 1-8)/ the item 

review should involve some additional checks. 

a) A"§hecki»:*for*lSalanced/ representative test coverage. 
*The question to be cinswered here is, "do the 

itemS/ as constructed/ still relate to the 
content/behavior cells in the i^tem Specification 
Table?" If a discfepamcy existS/ the item can either 
be revised to conform to its original purpose or re- 
classified on the Tcible according to its new content 
cmd/or behavioral objective. When large numbers of 
items are reclassified/ a check of the Table* should . 
be made to make sure that all content areas are given 
representative item coverage (in terms of the per- 

' centage weights specified in the table) . When 
necessary /^ reserve or newly constructed items should 

, be added to under-represented areas. 

b) If independent staff members (faimiliar with sxibject 
materia^ under study) take part in the reviews/ they 

V should read each item and cinswer it, in addition to 



31 



checking for item defects. Any discrepancy between 
their cinswers cuid the keyed (correct) cinswer would be 
evaluated by the staff with appropriate action taken 
to reduce ciny ambiguity surrounding the correct 
answer to that item. 

c) After the final item i>ool has been compiled and 

ordered serially within Item Sets, a final procedure 
should be carried out to determine the degree to 
which the content cind behavioral specifications of 
the items actually constructed agree with the item 
requirements as originally defined in the Specifica- 
tion Table. This is done by recording the number of 
each item in the appropriate behavioral outcome/con- 
tent cell on the Tcible (see Figure 1). When all 
item numbers have been recorded, simply check 
whether the number of items in each cell corresponds 
to the percentage of total items (the number between 
. the parentheses) originally allocated to that cell. 
If the relative numbers within the cells agree, then 
thqre is evidence that the test instriiment will be 
assessing a representative sample of the behaviors 
cmd content areas covered during the course of in- 
struction (provided, of course, that the Table was 
properly constructed according to the guidelines 
presented on pp. 12-16). If a large number of cells 
show major discrepamcies between numbers of items 
proposed ctnd items constructed, then items should be 
deleted or add^d until there is concordance between 
these values in the majority of cells. 

NOTE ; For more specific guidelines on construction of the 
recommended forms of objective items, with examples and dis- 
cussion of poorly constructed items, see Appendix C. 



ERIC ^ 43 



32 



Non-Staff Lecturers and the Prcblgn of Adequate Itan Coverage 

Ihe guidelines for ocndxirting the assesstent stuiy were originally 
designed for use with training pix>grams having a "resident" training staff. 
This procedure was based on the assatption that the personnel who designed 
the curriculuna plan will also conduct the instruction sessions. As field 
applications of the methodology continued, however, it became apparent 
that the average health-related training program involves input from a 
core (i.e., resident) staff as well as frcm a nii±>er of outside lecturers 
who are experts in their respective fields. This is especially true for 
programs in Family Planning and Maternal/Child Health, which require 
training in a nunber of diverse subject areas.* 

training atJninistrators (incloding staff instnxrtors) are responsible 
for the structuring of the training pr ogram (as discussed m Chapter II, 
Fp. 18-21) . "R^ey develop the ocnprehensive session-by-session curriculum 
plan for the sequence of instruction as v^ll as construct the Itan 
Specification Table for the assessment. the core staff conducts all 

of the sessions, it has the added responsibility of constructing all the 
itaiB ccDprising the coipetence assessnent test mstninsnt. This 
situation is somewhat altered when outside experts are called in as 
lecturers. 

V5iile the subject iiaterial to be ccrvered by outside lecturers is broadly 
defined by the training adnnmistrators (in accordance vith the overall 
subject theme of the programjV the specifics of \^t is covered are 
determined by the lecturer (a^^ubject expert) . Thus, thfe lecturer has 
prijrary responsibility for the ^ucture and content of his presentations. 
Since the subject areas covered by outside experts are part of the training 
curricula, it will be necessary to assess ccnpetence m these areas. The 
visiting lecturer will be the person most qualified to set the standards 
for ccnpetence ii^^kis particular area of expertise. Therefore, the non- 
resident lecturer should beoore an integral part of not only the trammg, 
but also the assessnent of its ijipact. 



* For exarrple, a four ncnth rrancogtone African Training Program conducted 
(in the Spring of 1974) at the National Sc±ool of Public Health in 
Rennes, France, involved the sponsoring staff frcm the D^sartment of 
Health and Family Protection as well as 30 outside instructor/lecturers 
fron such areas as Demography, ^Statistics, HiiTBn R^roduction & Family 
Planning, Matemal,''Child Health, Clinic Procedures, Health Aininistration, 
Health Education, etc. 



ERIC ' 4 1 



33 



In order to integrate the outside lecturer into the training program's 
evaluation framework, with a clear definition of vteit is required, a 
mirfeer of steps should be taken: 

1. Vten outside experts are initially asked to conduct certain 
sessions, they should be made aware that submitting test items 
for the assessnent of training would also be necessary, (Ihe 
imnediate and long range inportance of training evciluation as 
wall as the need to incorporate evcduation into the trcdning 
stnjcture during the planning stage should be iitpressed i^xxi 
than at that time.) 

2. Ihe lecturers should be provided with the general topic areas 
they will be responsible for covering. They should thai be 
requested to submit to the resident staff a detailed outline 
of the subject nBterial they plan to present at each session. 
This outline should also incliade judgments concerning the 
relative inportance (based on the relevance to ccnpetence 

* criterion, p. 16) of the topics and subtopics oocprising the 
subject area to be covered. Relative inportance is i n dica t ed 
by assigning percentage wei^rts vAiich will ultimately detemine 
the number of items to be allocated to each content oonponent 
of the subject area (see pp. 16-21) for a disdission of assigning 
relative wei^rts) * 

3. Each submitted txitline will be incorporated into a occtposite 
Item Specification Table. For each subject area entry both 
the types of ability to be tested and the nuiter of items to 
be oonstnicted will be determined and recorded (see Figure 1, 
p. 14). 

4. a. The Item Specification Table will be forwarded to each non-staff 

lecturer with the types and nuitears of items he is to design for 
eac^ subject entry clearly delineated. 

b. An achievement item "information paper" should al^ be provided 
to each non-resident lecturer. It would be a conposite paper 
ooitposed of those sections of the Manual v^iich are relevant to 
the oonstr\x±ion of test items. The paper should include: 

- The 7 Acdiievement Areas and sanple adiievement items 
(pp. 8-12). 

- Use and forms of objective-items 
(pp. 25-27). 

- Guidelines and rules for constnxrting objective-form 
items with exanples (pp. 28-31 and Appendix C) 

This material will provide outside lecturers with all the 
information necessary to construct objective -items that vdll 
adequately assess substantive knowledge and abilities (i.e., 
' ooqpetenoe) in specific subject areas. 



34 



Ftecjiiiring lecturers to cxai slr uct sudi items without proper guide- 
lines will probably result in groi?>s of items whi^ii are overly 
represented by the more oamon and easily-designed categories 
(understanding terminology^ facts and principles) , with under- 
" g^aresentation of those measuring nore ocqplex abilities (i.e., 
ability to predict/ recomiend appropriate auction and make 
evaluative judgments) •* * • 

5. Eadi of the items suhrdtted should be checked for adequacy in 

t erm s of the subject content and ability assessed. A j^ 'c p r i ate , 
ctenges in item structure and ccntent should then be made. The / 
types of items submitted should also be checked off on the Item / 
Speci f ication Table so that an assessment of item n^aresentation 
can be done. 

If the f iriS: s e le ct ion of Items by test trials with f ollow-x?) item 
analysis is not feasible (s^ pp 121':*127)^or a discussion .of item 
tryouts and analysis) , the items, should be put aw^ for cn«diile 
and then ^ rednecked for structure and content by the resident 
staff before putting together the ocnpleted instrunent. 

Requests for items from visiting lecturers should be made (when , 
timing permits) several months before the start of instruction. 
This will allow each lecturer several weeks to oocqply as well as 
provide adequate time to check submitted items, set them aside for 
. a ^^ort period and then recheck, revise and select the final ftems 
for the test instrument. | 



* This h^>pened in one field situation and necessitated major revisions 
in roafey of the items submitted by visiting lecturers in order to make 
the overall item content more representative of the wide range of 

Dilities subsiEied under the assessnent of subject coipetence. This 
^extra woric on the part of the resident staff mi^ have been avoided 
each visiting lecturer received a copy of the written guidelines 
item sairples. 



er|c • iG 



THE TEST FORMAT 



Once the Item Specification Table has been constructed and the 
individual items designed, the next step is to determine the 
most effective format for presenting these items to the 
examinees in a structured testing situation. ' ^ 

General Considerations - 

\ The procedures fof* test construction and administration to be 
described were developed to meet two basic criteria: * 

1) For trainees — neither the test directions or format/ nor 
the testing environment should produce variation in test 
performance that is not correlated with differences in 
levels of competence among trainees. 

2) For training administrators — ^ the test format should 
facilitate error-free scoring and should provide economy 
of cost/ time and effort. 

The testing procedure recommended is the one most commonly used 
in administering group achievement tests. Each examinee is 
given his own set of test items and instructions (in booklet 
form) together with a separate sheet for recording item responses. 
The examiner could administer the test orally, allowing a uniform 
response interval following each item. However/ while more eco- 
nomical in terms of cost of test materials/ this method is nat 
acceptable for several reasons: 

1) Many of the forms of objective items to be included are 
too. complex to be easily followed by ear; 

2) Examinees cannot return to previously given items / thus 
encouraging guessing rather than thinking through each 
item carefully; 

3) A structured time limit per item is imposed on the testing 
' ^ situation, (As will be described later in this sectior^, 

every effort should be made to keep the testing untimed,) 

Selecting the best method for reproducing test materials (i.e., 
item booklets and answeis forms) sfiould take into account such 
factors as the types of facilities readily availa|>le to the 
test constructor/ the number of copies required/ the types of 
items to be reproduced (e.g., simple worded items vs. drawings, 

:RIC 4V 



36 



pictures, complex diagrams), and funds available for such use, 
(A detailed discussion of available document reproduction tech- 
niques is well beyond the) scope of this manual; the reader is 
advised to discuss his particular requirements with those who 
specialize in^pdocument reproduction.) In most cases, the reader 
can employ the almost universally available and simple-to--use 
techniques of mimeographing and photocopying when duplicating 
test materials. 

The Test Item Booklet 

A variety of methods has been recommended for the layout of 
test items, usualiy suggesting the grouping of all items by 
Item format (e.g., multiple choice, true/false) and arranging 
all Items by increasing level of difficulty. However, several 
factors, specific to this proposed instrument, require a dif- 
ferent item layout than is usually the case. For this instru- 
ment, a satisfactory arrangement of items must include the 
following considerations: 

1) ' This instrument will contain separate subsets, each 

composed of items of varying format, 

2) The testing will be administered without time limits. - 

3) The test items will be administered twice to the same 
examinees (and possibly again to future trainees) , 

Given the above factors specific to this test instrument, to- 
gether with a number of factors related to achievement testing 
in general, the following rules and guidelines should be 
employed in the construction of the test item booklet: 

^) A separat e! response sheet for recording answers should 
be employgd*? no marking should be done in the test 

booklet. It is then only necessary to reproduce new 
answer sheets, using the same item booklet when re- 
administering the instrument- In addition, hand scoring 
and key punching will be facilitated when all item re- 
sponses are displayed serially on one answer sheet. 

A complet e set of directions, with sample items correctly 
answered, oovering procedures for answering items with 
different formats should be disfilayfed on the first pages 
of the test booklet. As stated earlier, each item subset 
may contain several item foa3a\s. Covering ^11 types 
of items at the outset makes itr unnecessary tp repeat 
directions each, time a different item formates con- 
fronted. Note: The only exception to 1 this rule is the 
Interpretive Exercise Item (see p, 27) l|which may require 



2) 



ERIC ^ 48 



37 



its own unique set of directions which should accompany 
the item. (Structure and content of directions are dis- - 
cussed on pp. 42-43J j 

3) Within each subset/ items can be grouped by format or 

by levels- of difficulty, or both . Either way is equally 
acceptcible since the conditions requiring one or the 
other of the methods do not apply here; that is, group- 
ing by format is required when separate directions for \ 
each type are given. Also, since the test is untimed 
and will be given to adult examinees, it will not be 
necessary to arrange items entirely according to 'in- 
creasing, difficulty. (The major assumption underlying 
arrangement by difficulty being that, in timed tests, 
if the examinee does not have enough time to finish the 
test, he will not have attempted those items he probably 
would not have cinswered correctly had he reached them.) 

4) The entire item (i>e^, the problem statement plus the , 
risponse alternatives) should be placed on the same 
page, or on facing pages > Furthermore, if tables, graphs 
diagrams, etc./ are presented, they should be placed on 
the same page as all items referring to them, or on a 
facing page. This is probably the most important rule 
related to the arrangement of iteinSf Having to turn 
pages back and forth to oBtain a complete ^idea as 
presented in an item can be^ confusing to an examinee, 
expecially when reference is ma<fe to graphs, tables, 
etc. Since it 'is not absolutely necessary to arrange 
items in any particular order within item set3, items 
can be rearranged on the various pages' so that this 

rule is not violated. 

^5)' When arranging items, correct regponses tp successive 
it^ms should follow no pattenv i Whena tentative 
ordering of items is completed, the corresponding correct 
answers should be checked to determine if a repetitive 
pattern of response positions has resulted * (e.g. , a,b, 
a,b,c,a,b,a,b,c, etc.) When this occurs, the pattern 
should be altered, either by rearranging the order of 
alternatives of several items so that the correct answer 
appears in a different position among the possible alter- 
natives, or by changing the order of several items. 



6) The test title, introductory seption (containing a 

itateroent of test purpose) and instructions to examine_es 
ih3^d ^ake up the first two fag^s of the test bookle_t 
with the first set of items fo/llowin q immediately. 
Although usually required in timed tests, this instru- 



ERJC 



43 



v-^ : ^ , : . • / ■ • ■ . 

» ment does nt>^ reguijfe'.th^ inoliaionfef a separate cover 
or Instruction and Item Section separator pages. They 
serve no useful- purpose 'and only a^d unnecessary 
materials and reproduction cost.' (gfeb p. 4.5' for a 'dis- 
cussibn of the rationale for employing a detachable 
Introduction/Instruction Section vhen designing the test 
booklet for Pre- and Post-Test use^) 

Labeling item setts and individual items . Each, item set 
IS consiaered a separate subtest (to be subjected -to 
independent analysi>s) and should be labeled as such. 
Bach subset should be preceded by a leading (e.g.. 
Item Set 1, .Item Set 2,etc:') and the items within each 
subset numbered in succession beginning with Number 1. 
Arrows can be placed between item sets to serve as 
separators and to show the progression of ^est items. 

Labeling also serves 'to ^id the examinee in test taking. 
Placing the same headings, item numbers and. arrows on the 
answer sheet will reduce the possibility of errors when * 
transferring the letters corresponding to the answers 
^ chosen from the test booklet to the answer sheet, 

point is discussed further in the following dis- 
cussion on recording answers.) 



The Answer Sheet " * , 

V 

Employing a separate answer sheet greatly benefits the test 
administrators. by reducing costs and facilitating scoring and 
key punching. That using separate answer sheets does not 
complicate the examinees' test-taking :b^avior is evidenced 
by the fapt that many widely employed>Schievement tests use 
separate answer sheets withv little or no difficulty, even at 
the primary school level. \ ^ ; 

Thepe are two general types of answer sheets in current use. 
.One is adaj^ed for hand scoring, using a type of overlay 
stencil with holes punched corresponding to the correct • 
responses. The other type is for machine sco2p.ng and is 
* specially designed to be read by a test-scoring machine or 
optical scanner. There is no advantage in using machine 
scored sheets unleTss: a) machine scoring facilities are 
readily available; b) the machines can accommodate individual 
subtest scoring and; c) the number of test^ to be |)roc^ssed 
warrants the expense of machine scoring. A more technical 
discussion of this mode of scoring is beyond the scope of this 
Manual. The reader considering machine scoring should discuss 
his specific needs with the suppliers of such services vhen he 
is planning the layout of items. 



39 



The design of the answer sheet to be hand scored or coded for 
key punching must be simple so £hat it provides an unambiguous 
task both for the examinee and the ^ scorer • Systematic errors 
either from incorrect recording of responses on the answer 
sheet or ^from inaccuracies in scoring will confound the analysis 
based upon resulting scores. 

The layout of the amswer sheet can be based on a variety of 
formats. The most appropriate format is one which takes into 
account the test )30ok]/et design as well as some general rules 
of test- taking. 

1. The general layout of the answer sheet should correspond 
to the item format in the test booklet . Item subgroups 
should be labeled by set number; items within each set 
should be numbered consecutively; beginning with number 
1/ and there should be separations between subsets W|th 

' indicators (arrows) showing the correct progression f^f 
items. The labels and arrows, appearing as they do' on 
both the test booklet, and answer sheet, reduce the 
\ possibility of error in marking answers* Also,, the item 

' responses are serially displayed on the answer sheet so 
that simple hand scoring can be easily and quickly 
carried out. A sample answer sheet, designed for a 75 
item test (with 5 item sets), is shown in Figure 2. 

2. 6l>mplete instructions for recording answer^^ should be 
stated on both the test booklet and answer sheet; sample 
items should also be provided in the booklet with space 
for responding to them set aside on the answer sheet . 
This .will allow the^pg^arainee (before beginning the 
actual 'test) to become faraj^lXar with the item formats 
and' with the procedure^or recording answers 'on the 
separate sheet. (Sample items will not be scored.) 

3. Examinees should be instructed to cross out (i.e., place 
an "X" through) the letter corresponding to the answ er 
selected for each iteih .. An "X" will readily 
appear through the holes^of the scoring stencil while 
"circles" around the letter may not, resulting in 
failure to credit a correctly answered item. 




40 



FIGURE 2 



MODEL ANSWER SHEET 



TRAINEE ID 
DATE 



Inttructiona; place em "x" through the correct, or roost correct. 

answer alongside the same item number as in the booklet. 



staple Itema 



1. abed 

2. abed 



SET 1 

1. abed 

2. abed 

3. abed 

4. abed 
j.S* abed 

6. abed 

7. a b e d- 

8. a b e d 

9. abed 

10. abed 

11. abed 

12. abed 

13. abed 

14. abed 

15. abed 



SET 2 



1. abed 

2. abed 

3. abed 

4. abed 



(se.t 2 cont'd) 

5. abed 

6. abed 

7. abed 

8. abed 

9. abed 

10. abed 

11. abed 

12. abed 

13. abed 

14. abed 

15. abed 



V 

SET 3 



1. abed 

2. abed 

3. abed 

4. abed 

5. abed 

6. abed 

7. abed 

8. abed 



(sqt 3 cont'd) 

9. abed 

10. abed 

11. abed 

12. a b e d^ 

13. abed 

14. abed 

15. abed 



SET 4 



1. abed 

2. abed 

3. abed 

4. abed 

5. a.b e d 

6. abed 

7. abed 

8. abed 
'9. abed 

10. ^a b e d 

11. ' abed 

12. abed 



(set 4 cont'd) 

13. abed 

14. abed 

15. abed 



SET 5 



abed 



abed 

3. abed 

4. abed 

5. abed 

6. abed 

7. abed 

8. abed 

9. abed 

10. abed 

11. abed 

12. abed 

13. abed 
*14. abed 

15. ^a b e d 

END - * 



SCORE 
SUHMARY 



SET 1 _ 

SET 2 _ 

SET 3 _ 

SET 4 

SET 5r_ 

TOTAL 
SCORE 



ERIC 



« 



'41 



ADMINISTERING THE INSTRUMENT 



The achievement measures derived from the iiest instrxament 
describeid here are based on respons'es to a set of structured 
stimuli. The test items themselves are only one component of 
the total stimulus situation, the other integral part being 
the test: instructions. The interaction of these two components 
gives the test situation its structure and provides a standard- 
ized basis for measurement. That is, what the completed instru- 
ment measures is not only a function of the types of items 
employed but also a function of the examinee's conception of 
the test's purpose and what he is required to do. Since test 
instructions ajid administration procedures have an influence 
upon the measurement obtained, their formulation should be an 
integral part of the planning and development of the instrument. 
Their development should parallel item construction and the 
design of the physical layout. 

The testing situatpion must be structured so as to reduce the 
influence on the examinee's test performance of factors other 
than those related directly to training outcome. 



Extraneous Factors 

In prder to underscore the importance of the guidelines, dis** 
cussed in this section, it is necessary to review those ex- 
traneous factors that relate to the mechanics of test-taking 
and the testing situation. 

1) The purpose of the test , depending upon how it is in- 
terpreted by the examinees, can positively or negatively 
influence test performance. The way in which the test 
is presented will have direct effects on such factors 

as attitude toward the test, test- talking motivation^ 
and arousal of test anxiety. 

2) Examinees in any structured testing situation vary widely 
with respect to prior experience with objective-type 
tests. Differential sophistication in objective test 
taking, unless controlled £or, will be reflected ii) the 
variation among subsequent scores. 

3) How the test .constructor deals with the problem of 
" response-guessing " will influence test performance. 

. Imposing penalties for guessing can introduce extraneous 
score variability due to influence of non-cognitive. 





42 



personality factors • For example, when penalties are 
assessed for wrong answers ("assumed guessing")^ it is 
likely that a part of the variation among scores will 
be due to individual differences in risk-staking 
, behavior. 

4) Time constraints can have an effect on measurement out?- 
come. There are subject areas in which speed of re- 
sponse is an integral component of achievement (e.g., 
measuring proficiency in such areas as typing, short-* 
hand, telegraphy, airplane navigation, etc.). However, 
the objective of this proposed instrument is to measure 
both the range of subject information acquired during 
training and the ability to adapt and apply this leaftrn- 
ing to new situations and not to measure how fast 
examinees can respond to, items correctly. In addition, 
timed testing conditions introdjace th^ risk of operation 
of personality factors (e.g., risk-taking behavior^est 
anxiety, etc.) which, although not directly correlated 
with what is being measured, could affect the measure- 
ment outcome.' The test should be administered as a 
"Work-Limit Test" of the type described by Ebel ( 9 ) , 
the objective of which is to determine how much the 
; examinee caji do, regardless of how fast or how slowly 
he works. 

TheXaibove factors are of primary concern in any test'ing situ- 
ation where objective- type items are employed. The need for * 
such^ontrols is increased when njeasuring a learning out- 
come under a Test/Retest design. Every attempt must be made 
to keep the structure of the testing situation as consistent 
as possible from -the Test to Retest . The following guidelines 
should be employed by the test constructor at the appropriate 
stages. 



Test Instructions 

In order to ensure uniformity of task orientation for every 
examinee, it is necessary to observe the followixig rule: 
regardless of any references made to the nature of the test 
prior to actual testing, make the assumption that tne examinees 
know nothing about the nature of the test or the mechanics of 
tes t- taking > The test instructions (in written form) should 
structure the testing environment. Test performance must. re- 
flect only the behavior the instrument was designed to measure, 
not the examinee's ability to decode' instruct^ions. 

^est administration procedures should provide a testing environ- 
ment that is uniform for all examinees. A number of points 

• ft 

Er!c • 5i' 



43 



should be covered in the introductory period preceding the 
test-taking. These discussion points can be grouped under 
several topic areas. 



1. Test Introduction 
a. The Pre-Test 

1) In general, excuninees should be informed that 
they will be given a test that will provide the 
training staff with information for use in {%) 
adjusting the level of instruction to meet the 
needs and demcuids of the current trainee group 
(see pp. 61-64; Utility of the Pre-Test) and 

(2) defining progreim strengths and weaknesses y 
when revising it for a future presentation. 
They should be informed that another test, to 
be given at the end of the course, will be an ' 
integral part of this evaluation. (They should 
not, for obvious reasons, be. told that the two 
tests will be the seime.) 

2) While the major emphasis should be on the 
importance of the test in helping to improve 
training, trainees should also be informed that . 
the information obtained will help determine 
how much they have individually learned as a 
result of the sequence of instruction. 

b» The Post-Test 

the reintroduction of test purpose should/ for the 
most part, be exactly the same in structure, and 
content as at its pre- training counterpart. ■ The 
exception is part (a) which should be revised so 
as to reintroduce the t^st as the second part of 
the evaluation procedure that was initiated at 
the beginning of the course. 

Note: The test, description and instructions should be 
printed on the first pages of the test booklet. I'his 
is to ensure that all the introductory material developed 
by the test designer* is presented in a standardized 
manner on both administrations* of the test. 




55 



Procedures For Responding and Recording Responses 



a/ The method for selecting the correct response should 
. be printed on the first page df the item booklet.* 
(This is necessary since the instrument is made up 
of a number of subtests, each of which will contain 
items varying in format.) In addition, the instruc- 
tions should be followed by examples of the types of 
items covered, together with the corjfect answers. 

b. Trainees should be requested to read the directions 
' silently while they are being read aloud by the 

examiner. (This is suggested since examinees do 
not always read int£-oductory material carefully and 
understanding such material is necessary for estab- 
lishing the optimal "mental set" for test- taking. ) 

c. The examiner should demonstrate how to record\e- 
sponse choices on the separate answer sheet. , A 
separate section for recording answers to sample 
items (provided in the instructions on the test 
booklet) should be printed on the answer sheet. ' 

d. Trainees should then be instructed to answer the 
sample items, recording their choices on the 



should be covered by the examiner to corr^t any 
errors in the mechanics of recording. 

Note: It is important, so that the test items function 
with maximum effectiveness, that the examinees understand 
the instructions completely before proceeding beyond 
this point. 



3. Instructions On Time Limits 

A printed statement should reflect the fact that the 
testing s^fesion will not be timed; that all examinees 
will have adequate time to attempt all items, ^hey 
should be instructed, however, hot to spend more than 
(X)** minutes average per item, at first to go through 
ythe entire te3t once and then return to any unanswered 

items . 
< * 



* Except for directions that should precede the more unique, 
complex items (e.g., interpretive exercises). 
It is the task of the item writer to determii^e the time re- 
quired to answer the average item, depending upon such 
factors as item difficulty, total number of test items, *etc. 



separate sheet. 




ERIC 




45 



4* Directions For Guessing 



Instructions should state that scores will be cal- 
culated as the number of items answered correctly, 
wit^ 112 penalty for guessing . Examinees should be 
strongly encouraged (since unanswered items will be 
counted as incorrect) to respond to every item re- 
gardless of whether or not they are completely sure 
of the correct answer. (They might also i^e told 
that each response given is an important bit of 
information which will work to facilitate the 
evaluation being conducted.) 



While formulation of the^ exact wording is the responsibility 
of the test designer (s), the fout factors described atbove 
should be included in simple and uncimbiguous language to en- 
sure that test-taking is as uniform as possible for all 
examinees. 

A model set of instructions (with Pre-Test Introduction) / 
incorporating the guidelines covered in this chai)ter/ is il - 
lustrated in Appendix D . The model is flexible enough so that 
the reader cam employ the same format when developing his own 
test instrument. (The model is complete for the Pre-Test; it 
can be adapted for the Post-Test by making the recommended 
changes in the introductory section.) 

Note: The introductory material will vary from Test to Retest 



while the same group of items is administered each time. 
Therefore, in order to minimize the cost of duplicating 
test materials, the test booklet should be designed so 
that the Pre-Test and, later, the Post-Test Introduction 
section cam be affixed to the same set of test items. 
That is, after initial testing the Pre-Test Introduction 
section can be removed from the test booklet and re- 
placed with its Post^Test counterpart. 



Guidelines For Administering The Instrument 

The timing of test administration (specifically of the Pre- 
Test) can have an unwanted influence on the examinee's per- 
formance. While the Post-Te^t should definitely be administered 
during one of the last formal training sessions, the time of 
initial testing should be based, in part, on the nature of the 
training group. 

1. In a situation where the training program brings 

individuals from different backgrounds together as a 
formal group for the first time, it is best to administer 




ER.1C 



57 



46 



the initial test several days after foznnal training sessions 
have begun. This will allow the participants to become ac- 
climated to the training environment. Since they will have 
totally adapted to the training situation at the time of 
retesting, providing a period of adjustment prior to initial 
testing will help equalize the conditions underlying the 
two administrations of the instrument. When testing is 
scheduled in this way, it will be necessary for the test 
designer to exclude items assessing subject material covered 
in the sessions before the Pre-Test, The added control over 
testing conditions should more than compensate for the re- 
duction in total course content coverage* 

In situations where participants have had time to familiar- 
ize themselves with the training environment prior to the 
J first formal sessions (e.g., when training courses are 
preceded by an orientation program) , the Pre-Test can be 
administered during the first formal training session. 

2. The introductory format (except for the minor changes 
suggested on page 4 3) should be exactly the same for both 
administrations of the test. Statements of puM)ose and 
instructions should be repeated orally by the examiner on 
the Post-Test exactly as they were for the PreAfest. A 

\ good rule to follow is: treat the administratifem of the 
Post-Test as if it were being given to the examinees for 

^ the first time. 



3. The same examiner should administer both the Pre-Test 
and the Post-Test whenever possible. 



4. When the test examiner is someone other than the test 
designer, he should become thoroughly familiar with the 
test materials prior to the first testing session. He 
should study the test booklet, instructions and answer 
sheet thoroughly so that he can administer the test in 
exactly the way it was conceived by the test designer 
and be capable of answering questions and dealing with 
problems should, they arise during the course of testing. 
It should be obvious that examinees are placed at a .dis- 
advantage when the examiner is not familiar with the test 
materials. 1 



ERIC 



bo 



47 



S. The examiner should respond to all questions related 
to test-taking mechanics. He is cautioned, hojvever, not 
to attempt to answer questions relating to individual test 
items. 



6. The examiner should scan each answer sheet as it is 
turned in and chec): that each has proper examinee identifi- 
cation. The final ^administrative task for the examiner is 
to make sure that all test materials have been returned^ 
especially the tesA: item booklet. 

7. At the time of pre-testing it is important not to re- 
veal the fact that the same test will be readministered 

at the end of the training course. If requests for copies 
of tlsst items and/or answers are made at the end of the 
testing period, the examiner should make a statement to 
the effect that items and answers cannot be di^ributed to 
examinees since they are a standard part of the Itraining 
program and their effectiveness in feature use requires 
that they remain confidential. At the same timp, examinees 
should be assured (when it is feasible to do sof) that the 
results of the evaluation will be made availablV~to them 
some time after the course has been completed. 



f 



CHAPTER III 
CODING And PREPARATION OF RESPONSE 

DATA FOR Statistical analysis 

This chapter discusses procedures for preparing the data on 
the answer sheet for hand scoring (using scoring stencils) and 
for computer-processed scoring (using punched card input) • 

The reader with ready access to machine scoring facilities 
will require special answer sheets designed for that purpose • 
This section is therefore optional for those intending to 
utilize machine scoring since the preparation of answer sheets 
can be left to the personnel in charge of such scoring services 

MANUAL/MECHANICAL PROCESSING ^ 

The Scoring Stencil 

The most reliable method for hand scoring separate answer 
sheets is to employ a scoring stencil ~ a simple answer sheet 
overlay w:,th holes punched corresponding to the correct answers 
The stencil, which can be easily constructed, reduces scoring 
to a standardized, mechanical procedure that is less subject 
to errors than are the more "free-style" hand-scoring methods. 

1* An easy-to-design and inexpensive stencil can be con- 
structed by reproducing a copy of the answer sheet and 
punching out spaces corresponding to the position of the 
correct answers. A sample punched stencil for use with 
the answer sheet (provided on p. 40) is shown in Figure 3, 
(The dark circles correspond to punphed-out areas, desig- 
nating answer positions; the dark rectangle when cut out 
will reveal the score summary section on the answer sheet.) 



2. An alternative to this type of stencil, which is punched 
according to a specific set of pre-selected answers, is 
a more -flexible, reusable type of scoring overlay, similar 
to the one shown in Figure 4, which was also designed for 
the sample answer form. (The darkened areas correspond 
to punched-out windows.) This type is very easy to use. 

a) Letters indicating correct amswers are written in the 
spaces, between the parentheses, which correspond to 
th^ir respective item numbers within item subsets. 
(The letters when pencilled-in can be erased and the 
stencil used again, provided that the same .number of 
items is used with the same format.) 



ERIC 



48 



6J 



FIGURE 3 



49 



HODEL SCORING STEHCIL I 



TRAINEE ID 
DATE 



PP/MCH Program*. 2 
4/74-^/74 



Instructions : Place an "X" through the correct* or roost correct* 

answei^ alongside the same item nuniber as m the booklet. 



SET 1 



1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 
10. 
11. 
12. 
13. 
14. 
15. 



a b 

a0c d 

a b 

0b c d 
a b0d 
a b0d 

al^c a 
a b c0 
a b0d 
a b 0d 
a^c d 
0b c d 
0b c d 
0b c d 
a0<5 d 



SET 2 

b0d 
a b <;0 
a b0d 
a0c d 



Sarple ItetM 



1. abed 

2. abed 



(set 2 cont'd) 



5. 


0b c d 


6. 


a0c d 


7. 


a b 0d 


8. 


a b0d 


9. 


0b c ^ 


10. 


a b c0 


11. 


a b c0 


12. 


0b c d 


13. 


a b0d 


14. 


a0c d 


15. 


0b c d 



SET 3 

1. a b c0 

2. * b0d 

3. a b c0 

4. * *0c d 

5. a b c0 

6. 0b c d ' 

7. a0c d 

8. * b0d « 



(Mt 


3 cont]J 


9. 


a0c J 


10. 


a0c d 


11. 


0b c d 


12. 


a0c d 


13. 


a b0d 


14.' 


a b c0 


15. 


a b0d 



1. 

2. 
3. 
4. 
5. 
6. 
7. 
8. 

10. 
11. 
12. 



SET 4 
0b c d 
0b c d 
a0c d 
a b c0 
a b c0 
a b0d 
a b c0 
a b c0 
0b c d 
a b0d 
a0c d 
a b c0 



(set 4 cont'd) 

13. a0c d 

14. a0c d 

15. a0c d 



SETS 



4. 

5. 
6. 



1. a b c0 

2. 0b c d 

3. a0c d 
a0c d 
a0c d 
a b c0 

7. a b0d 

8. a0c d 

9. 0b c d 

10. *0c d 

11. a b0d 

12. a0c d 

13. a b0d 
a b0d 
0b c d 

- E3SD - 




14. 
15. 



ERJC 



FIGURE 4 
MODEL SCORING STENCIL II 

I . 

FP/MCH Program 2 
4/74-7/74 








9.(_) 




13. (_J 


10. (_J 




14. (_J 


11- (_J 




15.(_) 


12. (_) 






13.(_jy 






14. (_) 






15. (_J 




l.(_) 




2.(_J 




.3.(_J 






4.(_J 



2.(_J 

<-LJ 

5. (_) 

6. (_J 

7. (_J 

••(_J 

9.(_J 
10.<_J 
U.(_J 
12. (_J 





62 



51 



b) The stencil is placed over the answer sheet so that 
subset numbers from the two forms line up and the 

' item choices on the answer sheet appear next to their 
item numbers on the stencil. 

r " 

c) Subset and composite scores are then calculated and 
transferred to the score summary column provided on 
the answer sheet. These scores will subsequently be 
transferred to a profile sheet which will displ'ay the 



Two sets of data will provide the input for the statistical 
analysis. The first s^t is made up of the responses of each 
examinee to all the items on both administrations of the test. 
This data in the form of an Item/Examinee data matrix is ob- 
tained from the individual answer sh^et-s. 

The second data set consists of the Test and Retest scores , 
both composite and item subsets. To facilitate the non- 
computer analysis of score data, some type of score summary or 
profile form should be employed. This is simply a large sheet" 
containing the score distributions for each examinee (identified ^ 
by an ID code) . Having all scores for all examinees displayed 
on such a form eliminates the necessity of having to manipulate 
a large number of answer sheets • (i..e. , 2 X the number of 
examinees) when analyzing score data, a situation conducive to 
quantitative error and unnecessary expenditure of time and 
labor. 

Score profile sheets are not difficult to construct and can be 
based upon any number of formats. An example of one type of 
profil^e, with sample test scbries, is shown in Figure 5. The 
value of using a score profile such as the one in this example 
will become apparent in a later discussion on non-computer 
analysis of score data. It should be noted that the value of. 
such a device is reduced if errors occur in transferring scores 
from the answer sheets to the profile. Whenever data are being 
transferred from one form to another, the forms should be 
checked liy another individual to ensure reliability of the d^ta. 



Processing For Hand Scoring 

1. The first step in processing is to assigl^ identification 
codes to the response sheets. The response sheet for each 
examinee should have the following basic informatipn: 




scores for all examinees. 



er|c o 



63 



52 



FIGURE 5 



'4 



SCORE PROFILE WITH TEST/RETEST SCORES 
FOR 31 EXAMIlfeES ON A 113 ITEM TEST 



EXAMINEE 
ID 


ITEM 
Tl 


SET 1 
Rl 


ITEM 
T2 




FP/MCH 


Progreun 


1 *(ll/73-12/73)| 


SET 2 
R2 . 


ITEM 
T3 


SET 3 
R3 


COMPOSITE 1 
ST CR 


01 


28 


34 


23 


32 


9 


12 


60 


.78 j 


02 


30 


42 


23 


29 


4 


6 


57 


77 j 


03 


13 


]9 


14 


17 . 


3 


2 


30 


38 


04 


26 


43 


26 


36 


9 


9 


60 


88 j 


05 


23 


26 


29 


33 


5 


11 


57 


80 j 


06 


29 


37 


30 


36 


7 


7 


60 


80 j 


07 


33 


33 


25 


34 




12 


65 


79 


08 


23 


31 


21 


28 


6 


6 


50- 


65 


09 


28 


32 


22 


24 


. 9 


8 


59 


64 


10 


29 


34 


24 


30 


11 


15 


64 




4 11 


25 


34 


21 


28 


6 


6 


52 


68 


12 


13' 


29 


28 


29 


4 


9 


» 45 


67^ [ 


13 . 18 


v35 


2»7 


30 


10 


9 


55 


74 j 


14 


22 


35 


• 21 


'30 • 


6 


9 


49 




15 


24 


36 


26 ' 


29 


6 


8 


56 


■73 Y 


16 


^^ 


36 


19 


30 


6 


10 


'^•41 


76 J 


• 17 


31 


41 


28 


39 


8 


10 


67 


90 1 


:L8 ' ^ 


^ 20 


36 


25 


30 


4 


9 


4? 


75 . 


-19 


19 


36 


23 


32 


6 


12 


48,. 


80 


20 


'21 


41 


23 


31 /' 


7 


6 




78 J 


21 


29 


37 


24 


33 ♦ 


9 


13 , 




83 1 


22 


8 V 


31 


14 . 


29 


6 


8 


28 


68 1 


. 23 


27 


35 


■ 1$ 


33 


7 




50 ' 


77 j 


24 ' 


29 


36 


, 25 


32 


7 


9 


61 


77 ; 


25 


28 


38 


27 


30 


10 




.65 


80 


26 


23 


38 


25 


33 


8 


12 


56 


83 


27 , 


27 


29 


28 


30 


J 


8 


62 




28 


2S 


39 


29, 


38 


6 


12 


*65 


89 


. 29 


18- , 


38 


V 24 


28 


" 6 


8 


48 


74 


30 t 




35 


23 . 


33 


5 




50 


75 


31 


26 


31 


-*28 


33 


5 


» 


59 


73 


• ♦ 








• 
















* 















ERIC 




a* • An exclusive, two-digit (or more) ID number to identify 
the trainee. 

ii» A code to 'designate whether the responses are from the 
the Pre-Test or Post-Test (e.g., X for Pre- and 2 
for <Post-Test) . 

c. . The date of administration* to identify the specific 
bourse attended by the trainee.^ (If more than one 
course is being conducted during the same time period,' 
each course should be assigned a code number and the 
appropriate one written on the sheet.) 

< - r 

In certain*" s^.tuations It may be necessary to add further 
identification information to the response sheets. For 
example, the training staff may wish to evaluate the 
Effectiveness .of instruction on various tradlning subgroups 
(e.g., professional vs« paraprof essional health workers). 
* In this case, a code which classifies the trainee into 
one or another subgroup would also be entered on the 
response sheet. 

' . <^ ' . ' ^ ^ 

2. Each answer sheet should be visually scanned to detect any 
items with more than one answer marked. This is necessary, 
prior to scoring with scwae stencils, since only the correct - 

' answer: will .appear through the punched hole, v Multiple- 
anaWered items are to be regarded as incorrect and should 
be eliminate.d from scori^ig. A colored line drawn across 
the response choices for Jihose items will show through the 
stencil, indicating to the scorer that the item is to be 
^ (. di«scounted. 



t ! . . sl^ 

HAND SCORING PROCEDURES 

General Scoring; Item Set and Composite Scores 

When deriving the item set and composite scores, the hand 
tabulating method used should result. in erxor-free scores with 
a^jninimum amount of staff time a;nd«effOrt. 

The employment of a scoring stencil and separate ans^^er sheets 
has been suggested to facilita€e scoring with reduced errors. 
In a further ' effort to reduce ,the chances of error and con-, 
fusion when processing individual item response protocols, a 
systematic'' procedure should be used in tabulating and recording 
♦ scores. The approach recommended is as follows: 

V i 

1. Tabulate each item set score separately and record thfe 




54 



score in the place provided on the 'answer sheet (see 
Figure 2f p. 40) . 

* 

2* Sura all itiem set scores, to derive the total (composite) 
score and record it on the answer sheet • 

3* When all scores have been tabulated and checked, trans-' 
fer each one to its appropriate space on a score pro- 
file form. 

4* Repeat steps 1-3 f or^ each trainee's answer sheet. 



4 



The Test^ score data that are to be used (along with the Re- 
test scores) in the sample analysis runs illustrated in 
Chapter VII are displayed on the Score Profile Form in Figure 5. 
With scores displayed in this manner, summary statistics 
(i.e., means and standard deviations) can be easily computed 
for the Pre-l'est scores (and later for the Post-Test scores). 
These statistics can be useful when -discussing the trainees 
te^t performance, since -they summarize the masses of score . 
data and provide a' concise description of over all pre-training 
(and post-trainihg) levels of subject matter competence^^ 
The importance of carrying out separate analyses of Pre- and 
Post-Test data, immediately after tkeir respective administra- ^ 
^ tions, in addition to the combined Pre-/Post-Test da|:a analysis^ 
will be discussed farther in subsequent sectipns of the 
Manual. • • * 

Scoring by Trainee Subgroup; ^ Item 'Set and Composite Scores 

It should b^ noted that on the Score Profile (Figure 5) , the 
31 trainees were considered as a single group when the itpm 
set, composii:e scores and summary statistics were computed. 
An alternative approach is to conduct the score analysis on ^ 
subgroups of trainees, determined on the basis of one or more 
rel^v^nt parameters.^ Some factors that could be used to 
define subgroups are the results of. Pre-^Test scores (high, 
medium, and low scoring' subgroups) , or' the educational/experi- 
ential backgrounds of trainees (e.g., professional vs. parar 
professional groupings). The purpose of such groupings would 
be to determine wh^theft the training has greater or lesser 
effectiveness with one groiSp than with others.' That is, does 
a specific factor such as initial competence or professional 
background have vsome influence on the degree to which the 
training instruction is successful in meeting its objective 
of increasing subject competence? The decision to subdivide' 
trainees and the selection of factors that rwould underlie any 
subdivisions will be made by the training administrators 
accprding to training objectives and composition of the group. 



ERIC 



55 



If the decision is made to put trainees in subgroups, the end 
of instrtiction Pre~/Post-Test analysis of item set and com- 
posite score distributions should be conducted jseparately for 
each subgroup so that inter-group comparisons can be made. 
In such cases, it is recommended that separate profile sheets 
be^ Used for each trainee subgroup,, using the same format as 
the Profile in Figure' 5,' and employing distinct ID codes to 
identify the various subgroup profiles. ^ " 



COMPUTER PROCESSTOG 
Editing - , 

1. Same as processing step 1 for hand scoring (p. 53)* 

2. , Item^ with more than one answer marked are to be 

eliminated from scoring by drawing a line across the 
item on the answer sheet. (Wh^n punching, the card 
columns corresponding*' to these items will remain 
blank.*) 

3. All item responses should be number coded and trans- 
ferred to 80 coluhin key punch coding forms. .{See 
Figure 6 for a sample coding form.) Punching should 
be dpne from these coding forms and not directly from 
the answer sheets, to reduce the possibility of error. 

' coding ' * 

Systematic coding procedures must be employed since consider- 
able data manipulation is inyolved, presenting numerous pos- 
sibilities for error. The following sequence outlines the 
procedures required tp convert the raw data (individual item 
responses) to coded data for use in 'pu^ich card processing. 

The format described is required for data that is to be 
comj^uter-analyzed undet the specific s^^stem of scoring and 
analysis programs presented in the Manual. (See Appendix. E) 

1. a) When the response alternatives are listed alphabeti- 
cally on th6 answer sheet, it is necessary to convert 



* Unless a standard Item Analysis of the Response data is to 
be conducted (see.pp* 121-127) most Item Analysis proce- 
dures require a code for differentiating between the types 
of possible responses given tb an item (i.e., the correct, 

,'' incorrect and multiple response and no response.) A 
; suggested code is- provided on p'age 56. 



V* 



56 




responses to their tixameric counterparts (a=l, b=2, 
etcO. The converted numbers designating an answer to 
an item should be written in to the rigl>*: of that item* 
(Note: For Item Analysis purposes (see pp, 121-127)^ 
no response given should be coded "0" and multiple 
responses, whi9h are to be considered incorrect, 
should be coded "9".) 

b) When the conversions have been completed for aXJT answer 
sheets, each of the sheets should be checked for con- 
version errors* If possible, checking shoul^be done 
by a second person. 

2. The response choice numbers should then be tyansferre 
80-column coding forms. It is assumed that /thoL plan- 
ning to use key punch and computer facilities will have 
teady access to standard coding forms. l£ this is not 
the case, a simple form can be constructed using the 
sample form displayed in Figure 6 as a model. 

Punch Card Data Format 

In order to run the data with the programs provided in Appen- 
dix E, it is necessary to transfer the data to the coding forms 
adcording to ^a specific punch card format. That is, data are 
to, be displayed on the coding forms so that after punching 
there is a card (or cards, depending on the number of items) 
for each trainfee^ containing his responses to all items.* 
The tirst card for each trainee must contain an ID number. 
When more than one card per trainee is required, these, cards 
should also contain ID numbers to* identify each trainee's 
cardi^ should the data deck become disordered. 

The data transfer from answer sheet to coding form should pro- 
ceed according to the following punch card format (each 80- 
cplumh row on the co<fing form equals one punch card)/ 



.A more complete description of the computer input data 
formats w:^'ll be found in the documentation for Program 
COMSCOR in Appendix E. 



ERIC 



57 



Data card format (for each trainee) : 

Card 1 ' ^ * 

Colxiinns 1-2 : trainee ID number (required) . 
Coliamns 3-80 : item responses.' If total number of 

items exceeds 78, extra cards will 

be needed. 

Cards 2-4 

Columns 1-2 : trainee ID number. 



Jol 
' 1 



umns 3-80 : item responses. 



After th6 last trainee's set of responses has been recorded on 
the coding form, each set should be checked for possible er- 
rors. (Again, it i^ suggested that cards be checked by a 
different person, if possible.) A sample coding form display- 
ing the responses of four trainees to a 113^item test (from 
the Rennes Assessment Study) is shown in Figure 6. 



Keypunchfng 

The only requirement for punching the response data is that a 
keypunch comparaible to the IBM 029, which employs an EBCDIC 
code, be used. The FORTRAN IV programs presented in Appendix 
E require that data decks be punched according to this code. 
It is suggested that thi-s requirement be discussed with avail- 
able computer personnel before the data is punched. 



Since all of the numeric data for the complete statistical ana- 
lysis will be contained in the Pre-Test and Post-Test item 
response decks, all appropriate procedures for data verifica- 
tion should be employed during the punching stage. ^ 

In order to avoid confusion^ during the analysis stage, each 
response deck should be ^^fesignated as either Pre-Test or Post- 
Test. Such identification can be written (with an ink marker) 
cither on the front and back data cards or on the ,top and 
•bottom edges of the card deck. 



COMPUTER .SCORING PROCEDURE?? 




General Scoring; Item Set and Composite Scores 

The computer scoring of item responses by separate item sets 
and composite scores will be carried out by Program COMSCOR, 
using as input the punched card dkta decks, the preparation of 
which was described above. The prpgram will compute scores 
and summary statistics for each jjfeem set as well as for the 
item composite, treating the tra^lnees as a group ^ A complete 
description of the requirements for and capabilities of Pro- 
grcim COMSCOR is provided in the documentation for the program, 



DATA CA: 



Health and Family Protection - 11/8/73 . 

Pre-Test Item Responses (31 trainees x 113 items) 



03 



— ^pH" 



J 



DIKG FORM 



Page I 



•1 u >t >t >• n 10 




tt!: 



mm 



130 



AM- 



Of «<a^f 



3^ 



EE 



1 




1W 



li 



1 



E 



TT 



TTT- 



TT 



i 



TTTT 



m 

t ■ 



TTITl 



■ 



1 



m 



± 



m 



ITT TTTT 



TT 



i 



TTT 



ms. 



I 



m 



TT 



n 



TTT 



^ AA M. 

ERiC 



(1. 



60 



in Appendix E. L 

Scorj>ng By Trainee Subgroup; Item Set and Com posite Scores 

A variation in th.e computer- scoring pr'ocedure will be neces- 
sary if the trainees have been divided into subgroups, and 
inter- group scores and item response analyses are required, 
The input data deck would be divided into subdecks, each sub- 
deck corresponding to a subgroup of trainees, and serving as 
data input for analysis of subgroups. For example , with high, 
medium, and low score subgroups, the program (COMSCOR) would 
be run three times, resulting in three separate item set a^ 
composite score printouts, with this data available, it W®ald 
be possible to make intergroup comparisons to check for ato|r. 
differential effects of instruction due to differences in H 
group characteristics. Separate subgroup runs fpr all other \ 
con^juter programs comprising the analysis package would also 
be carried out. 



Timing of zhe Scoring; the Pre-Test 

Although it is possible to defer the scoring of the responses 
to the Pre-Test until the end of the training sequence, run- 
ning both Pre- and Post-Test analyses together, it is recom- 
mended that the Pre-Test be scored as soon as possible after 
its administration. While a comprehensive discussion of this 
issue is presented in the next chapter, it should be noted at 
this point that the scoring data derived from the Pre-Test can 
be valuable to training administrators and instructors, by 
giving them information to help guide the course of instruc- 
tion, to make it more responsive to the needs and demands of 
the current trainee group. 



ERLC 



7 ' 



CHAPTER IV 
UTILITY OF THE PRE-TEST 



In terms of assessing the effect of training on subject matter 
competence, the Pre-Test is used to provide pre-instruction 
baseline levels of competence against which the post-training 
levels are to be compared. However, the Pre-Test also has 
inde|fe)endent utility as a means by which the proposed course 
of instruction can be assessed and modified (if necessary) to- 
meet certain trainee needs, and demands. 

Before becoming involved in a discussion of the first stage ^ 
of data analysis, it is necessary to consider the three basic 
assumptions which underlie the evaluative design. 

The importance of establishing the validity of these assumptions 
rests on the fact that the degree of confidence ascribed to^amy 
inferences concerning the effectiveness of training is based 
upon the degree to which the assumptions are considered 
tenable . 

Assumption 1 , The test item content provides a repre- 
sentative sampling of the subject content comprising the 
course of instruction. 

Assumption 2. All items were properly constructed and 
critically reviewed (with the necessary revisions made) in 
accordcmce with the rules and guidelines presented on pp. 28- 
31 amd Appendix C. ^ 

Assumption 3 . . The Pre-Test, scores of the train,ees should 
^tend to be quite low across subject -areas . That is, the 
•incoming trainees should not, as a group, be highly com- 
petent in the subject material comprising the training. 
There would be little value in constructing a training 
program with students who were already competent in the 
material to be covered, (The following section will dis- 
cuss situations in which this assumption does not hold.) 



One way in which an analysis of the Pre-Test scores can be of 
value is as an indicator of whether or not the course material 
(as outlined in the curriculum plan and as represented by the 
test item content) is appropriate (or adequate) for the 
current group of trainees. Consider, for example, a situation 
in which a majority of the trainees score very high (e.g., 80%+) 
on one item set and do as expected (see assumption 3, above) 




61 



42 



on the remainder of the test. Since the pre-instruction com- . 
petence in on^ of the subject areas to be covered is already 
quite high, it would be a waste of time, both for the train- 
ing staff as well as the trainees, to present the course with 
the same subject material as outlined .in the original curric- 
ulum plan. It would be necessary to effect some revisions in 
the proposed curriculum to make the instruction more relevant 
to the demands of the group. This can be easily done by 
adopting one of the following procedures: 

1. Corf#ider the trainee group as competent in the high 
score ajea and drop the subject area completely from 
the curriculum, devoting all of the training time to 
a coverage of the low-score areas. This might be 

. done in situations vrf^re the scores in one subject 

\ area are so high that the instructors feel that little 

more could be learned by the trainees in that subject 

area. 

2. Keep the subject area in the curriculum, but shift 
emphasis away from the high score area toward a more 
intensive coverage of the areas in which the trainees 
displayed low pre-instruction competence. 

3. Keep the subject area in the curriculum, but upgrade 
the instructian to a more advanced (and possibly more 
difficult) coverage of the subject area. That is, a 
segment of instruction would be devoted to coverage of 
subject material in an area that required an already 
basic level of competence, (This is similar to the 
practice of using the introductory course or an 
advanced placement exam as the prerequisite to the 
intermediate and/or advanced level courses in a 
specific area.) In some subject areas, it might not 
be possible to upgrade the level of instruction so 
that it would be better to drop the area from the 

. curriculum and concentrate on the other areas (see 
option 1) . 

While it is possible for the trainees as a group to score very 
high in certain subject areas and low in others (the type of 
situation described above) , another type of situation is more 
likely to occur. This would be one in which different in- 
dividual trainees or trainee subgroups displayed varying 
degrees of competence in different subject areas. Such a 
possibility might exist since (in many training situations) the 
trainees have dissimilar educations, professions, or ex- 
perience. Such heterogeneity is quite likely to mani- 
fest itself on an achievement test through variations 
in scores both for the indivi^dual trainee as well as for one 



ERIC 



71 



63 



or more trainee s\ibgr;Dups. When the instnment is administered, 
aLB a Pre-Test, to such a heterogeneous group, the picture that 
is likely to emerge is one in which a few trainees obtain 
consistently high scores, some get consistently low scores and 
the majority displays wide variations in the pattern of l^igh 
and low scores • Such a high-low profile of score " scatter •* 
can be used to determine which individuals or groups of 
trainees require special instructional attention ajid the sxibject 
eureas in which such attention is required • This caji be done by 
considering the scores for each segment of instruction (i.e., 
as represented by the sepctrate item sets) independently. On 
the basis of each group of item set scores, the trainees caii 
be grouped (using some pre-defined set of cut-off scores) into 
•high/' 'medium' or 'low' scoring categories. Special atten- 
tion could then be provided to those trainees in the 'high' 
and those in the 'low' score categories. This special atten- 
tion, commensurate with competence level of each of the sxib- 
. groups, could take the form of small group and, when necessary, 
individual trainee tutorialsH as well as outside supplementary 
readings amd/or projects., 

The grouping of trainees, in addition to, helping adapt the 
level of instruction to the needs of the trainee sxibjgroups, 
cam also serve an ipaportemt evaluative function by adding 
another level of evaluation to the already existing stages. 
Up to this point, en^hasis has been upon an analysis of test 
data for the total group and for the individual trainee. With 
the total group now subdivided into 'high,' 'medium' and 'low' 
pre-scoring categories, it will be possible to make a quaiiti- 
tajtive assessment of, the relative impact of instruction in 
terms of the traine.es' incoming level of competence. Since 
the possibility exists that^a trainee's relative increase in 
competence is not only a function of the effectiveness of 
training, but also a function of the trainee's initial (i.e., 
pr^course) level of competence, am amalysis of scores by 
traiinee subgroups might prove valuable. 

* In cfrder to carry out the analysis at this level it would be 
nec^sary to ramk-order thfe trainees on the basis of Pre-Test 
sco^s from highest to lowest score. The trainees would then 
be split into 3 groupings of equal (or near equal) size and 
the groups labelled as 'high,' 'medium' and 'low' scorers. 
The Test/Retefet data for each subgroup would then be subjected 
(separately) to the type of amalysis described in Chapter yjl,_' 
The final assessment of training effectiveness would be basted 
on the pooled findings of the toalysis of data for each subA 
group together with a series or inter-group comparisons of 
trends . 

V 




7 



64 



It should be noted at this point that subdividing the trainees 
into scoring categories for purposes of individual subgroup 
analysis should not be done unless the pre-score data warrant 
such a breakdown. For example, if after ranking, only a few 
score points separate the highest from the lowest scote, then 
it would not be of value to conduct the analysis of subgroups. 
It is suggested that separate subgroup analyses be conducted 
only when the range is wide enough (e.g., 20 or more points, 
with the trainees more or less equally distributed along that 
score "^range) to help ensure a moderate amount of score differ- 
ence between subgroups. A close grouping of Pre-Test scores 
would not permit tjie^ subdivision of trainees into subgroups 
that display di,st5riminable differences in pre-instruction 
levels of copftf^tence. 

Another sotuation^ thatr^could occur is one in which the trainee 
group obtains an extremely low mean score (i.e., one approach- 
ing zeit5) on the Pre-Test, The question that arises is whether 
02: nj3ft the curriculum level is unreasonably high and beyond • 
th^ capabilities of this group, given the ^amount ^of instruction 
time available. Opinions in this respect v<ould be influenced 
^by indications th^t the ttainee group has (or does not have) 
a basic xinderstanding of prerequisite terminology, concepts, 
history, etc. The decision would then have to be made as to 
whether or not the course level of instruction is to be lowered 
and, if so, to what degree. A similar question and decision ' 
would pertain also to any trainee subgroup who fell markedly 
below the Pre-Test score levels' of the others. 

NOTE: If, on the basis of data provided by the Pre-Test, the 
curriculum and, thus, the course of instruction is revised. 
It will also be necessary to make revisions in the content 
and/or structure of the test instrument and the Subsequent 
Test/Retest analysis. The effect on ^ the evaluation of train- 
ing developing out of these revisions will be taken up in the 
discussion on the utility of the Post-Test (see Chapter VI). 




CHAPTER V 
THE CURRICULUM AUDIT 



Definition and Purpose 

On. pages 28-31 (and Appendix C) , guidelines are given for 
facilitating the construction of test items (prior to the 
beginning of instruction) using only the infoinnation provided 
by the curriculum plan and .the list of specified learning 
objectives, it was shown that the learning outcome being 
sampled by a test item is defined by the specific grammatical 
and semantic structure of that item. Thus it is possible to 
enstire, at the time of test construction, that the items 
represent a valid sampling of the types of behavioral objec- 
tives (i.e., specific learning outcomes) that fehe training 
program attempts to achieve. However i^ not possible to 
determine during the test development stag^ whether or not 
the item content is a true representative sample of the types 
of subject material actually covered in training ^-T^eref ore 
the issue of content validity requires further consideration. 

n the Test/Rete^-t design, the testing instriiment must be 
onstructed before the sequence of instruction has actually 
been carried out. Unlike the tests of learning common to the ^ 
school and college classroom where test items are usually 
written after the subject material has already been covered, 
the Pre-/Post~Test instrument consists of items developed 
exclusively from the curriculum study plan. The validity of 
this procedure rests on the assumption that what is outlined 
in t:2fejBtudy plan will actually be presented to the trainees 
during the course of instruction. This assumption is warranted 
with certain types of formal instruction: an example is the 
basic or introductory course where there is a well-defined 
core of specific material that must be covered (e.g.,. Algebra, 
descriptive statistics, English graumnar, etc.). In this case 
the course content generally remains the same, regardless of 
the teaching approach used or the instructors involved. 

Although there probcibly are training programs of long standing 
whose curriculum plans are well-established and e(ccurately 
reflect the structure and /content of the actual course of 
instruction, this is not generally the case. Training programs 
are directed toward preparing individuals for some specific 
vocational objective. 0;lce training begins, it is not unusual 
for the instructor (s) to alter the subject content (as out- 
lined in the curriculum plans) to meet the needs and interests 
of the trainee group and to. conform to the general level of 
siibject competence as indicated in a Pre-Test. 



77 



66 



Further, unlike well^-established courses ^here both the sub- 
ject content and the criteria' jrequired for subject competence 
are highly structured, the training program curriculum ia 
often less wellrdef ined ai)d usually undergoes some modifica- 
tion eafeh time the sequence of instruction is presented • 
Regardless of the caus^ there is a high probability of dis-- . 
parity between subject;, topics proposed for study as described 
in the curriculum plan/ and the actual areas'covered during 
instruction. This discrepancy may result in lower content 
validity of test items developed prior to. the training, thus 
weakening the inferential powers *of the test results. There- 
fore, the conclusions drawn .'from the statistical analysis of 
scores 'must be we^.ghed against an estimate of the degree of 
concordance between the content of instruction and the content 
of the test items. This estimate is best obtained from' a 
detailed study of the content of instruction as it is actually 
presented, through a . Gurriculum Audit . * 

The basic requirement for such an audit is a systematic pro- 
,cedure for rating the degree of coverage given during each 
training session with relation to the specific subject areas 
sampled by each test item. This is best accomplished By the 
use of a content checklist, a listing pf test items classified 
by subject content. An explicit ;statement defining each item' 
content can be provided by the test constructor from his test ^ 
blueprint Xsee pp. 12-16) . a checklist is then made by con- 
structing a two-way grid with the item content list on the 
vertical axis and the horizontal axis labeled with the degrees 
of coverage (i.e. ? -complete/partial/not covered). An example 
of this type of checklist is' provided in Figure 7. Alt^iough 
it is designed for a hypothetical statistics training sequence 
the basic form can easily be adapted for any type of training. 
(The checklist is incomplete sinc^ the numbers and types of 
items required to provide adequate coverage of such a training 
sequence are too numerous to illustrate here.) 

Concurrent and Retrospective Auditing 

There are essentially two way^' to conduct a curriculum 
Audit; namely, concurrently or retrospectively. 

Concurrent. Auditing the name implies, is an assess- 

ment df concordance ooifducted on a sessioh-by*-session 
basis while the trai'rdrng is in progress. This approach 

/ ' a/ ■ • ^ 



Developed by S.M. Wisl^ik for the evaluation of MaternA and 
Child Health courses he conducted at the University of . 



Pittsburg. 



ERIC 



67 

FIGURE 7 
SAMPLE ITEM COVERAGE CHECKLIST 



quant, meth, prog, ^stat, ; 
10/72-2/73 


DEGREE OF COURSE COVERAGE 




ITEM # CONTENT AREA 


COMPLETE 


PARTIAL 


NOT COVERED 


1 nominal, ordinal, mteriral 
& ratio laeasurement scales 









2 grouped frequency 
distributions 








3 cumulative fre^encies 
& distributions 




> 




4 percentiles 








5 percentile ruiKs 








6 measures of central 

tendency: meem^ median 








7 selecting the appropriate 
measure of centra ten- 
dency ^ * 








fi vaI at 4 nnahio hfttwpen CATI^Ta] 

tendency measures & the 
shapes bf distributions 






t 


^ 9 measiires of variability ^ 








10 mathematical operations 
with the variance, fi 
"Standard deviation 






4 


11 the Concept of the 

E^Anuou* souu^xc 








12 characteristics of^A 

sampling distribution • 






- 


13. sample stMitlstics as 

estii^^ors of populat~ 
ion parameters; mean & 
standard deviation ^ 








14 normal distributions 








15 use of the table s^of^ 
the normal distribut- 
ion 




* 

» 


f 


^ » i — 
16 statistical hypotheses 








17 the problem of error in 
hypothesis testing 


\ 






18 statistical inferences 

selecting the approgr^^^^ 

^ ^tatial 1 1 1 1 s . • 





68 



requires that an auditor sit in qn every training ses- 
. sion. (The auditor should be someone other than the 
instructor who is familiar with the nature of the sub- 



xng 



ject material.) There are two approaches to con^uc 
the concurrent audit: 

d) Employing the Item Coverage Checklist 

The auditor should check the training syllabus (i.e. 
the curriculum plan)^ for each session to identiify 
the specific items on the chec^klist whose -stifeg^ct 
areas are scheduled for discission. (The items 
^should be listed in the sequence in which the sub- 
ject acreas they sample are to be covered' during * 
inst;ruction.) During the session he will record 
on the checklist, next to those items, whether or 
notthose subjects were, in fact, presented and the 
degree to which they were covered. 

Complete and/or no coverage of a content area can 
be designated *by placing check marks in the appro- 
^ priate cells adjoining .that area on the checkli&jb. 
^ A brief sketch of what was and was not covered 
should, be r.ecorded, howevei?",' when partial coverage 
of a content area is to be noted. 

This procedure is conductec? for each, of the trains 
ing sessions. At the end of training there will be 
a record of the degree of coverage of subject areas 
sampled by all of the test items. (The data from 
the checklist can then be reduced to a -table of 
summary tabulal^ions (see example, p. 73) for use in 
the interpretation of the statistical analysis of 
test results. 

b) An Alternative to the Item Checklist 

The training seminar is designed as a structured 
learning experience for the participants. Seminars 
are conducted according to predetermined curriculum 
plans which structu're each sequence of instruction 
in terms of the specific subject content to be 
covered as well as the sequence and mode of presen- 
tation 0f that content. 



In order to maximize the learninc 
is often desirable to encourage 
structor ^p^participant) over the/ 
tor participant) instruction 
type of interactive flexibility 
will not always be possible to m^ 
cb^dance between the subject con1 



opportunities, it 
Ihe two-way (in- 
one-way (instruc- 
idel. When this 
s emphasized, it 
intain high con- 

nt of the seminar 



ERLC 



8J 



69 



and the content outlined in the curriculum plan* 
When the sequence of instruction has been altered, 
the subject material reflected by the test items 
may not necessarily be covered in the session 
planned, but at a later (and possibly unspecified) 
time. When there is, a strong probability that such 
alterations in sequence will occur, a variation in 
the Concurrent' Audit as described above, should b^^^ 
implemented. 

An auditor would still attend each session but thej 
Item Coverage Checklist would not be used. Instead, 
a brief "subject coverage-by-session" table would 
be constructed by the attending' auditor. That is, 
instead of employing a pre-constructed Item Check- 
list (as describi^d above) the jauditor would compile 
a content outline for each session attended. 

The auditor would list, in sequence, the major sub- 
ject areas (by topic and subtopics) together with 
a brief description of the content subsumed under 
each major heading. Then, for each major topic, 
he would describe the degree of content coverage 
in terms of substantive variables, such as amount 



types of examples presented and/or computations 
carified out, nature and number of student exercises 
provided, , etc. » 

A table, derived from a statistics seminar, which 
* displays a sample enti^ fon a training sessi9n 
covering measures ot central tendency is in Figure 8. 

" -J 

Each description of topic and coverage should be 
brief and concise so that the auditor can spend the 
major portion of each session attending to the sub- 
ject presentation, not filling ^in the tables. In 
most cases, a few key words can be used to describe 
the topic areas together with a concise paragraph 
summarizing the coverage activities. However, enough 
information should be provided so that the tables 
are self containedw(i.e. , self-explaining) summaries 
of each training s^fcibn both in terms of topics and 
instructor/participant activities. 

Upon completion of the final session of the seminar, 
these tabular summaries would be compared with the 
items comprising the test instrument. The content 
of each item would be compared with the relevant 
topic listed in the summary to assess the degree of 
concordance between subject content of the item and 




list of concepts discussed. 



ERIC 



8x 



70 



/ 



FIGURE 8 



SAMPLE SUBJECT COVERAGE-By-SESSION TABLE 



Statistics training seminar - 11/16/72) 


topic by session 


degree of coveragev ' 







SBSSiO/V 3 



^f-^*^t^t t&rUMMC^ 



^Uc^ThJ. JtaPk. u^Ji %UZc 




ERLC 



71 



the coverage provided that subject during, ihe session. 
This adequacy-of-coverage ptocware. would be con- 
ducted on an item-by- item basis. (A broad -XIOMPLpTf/ 
PARTIAL/NOT COVERED Scale will be ad^^q^ _ . . 

classifying each ite^^m"'"tia45j;;ee of covei"a^«-J — ^ 
This data can then be, rlduceat?rya.-..,fre table 
(see pi* 73) for later use in the intet^c^ation of 
statistical results. ' ' 





NOTE : Assigned outside readings providing relevant 
subject materia l^'TOt covered during th^' trai^iing 
sessions should also be noted during the audit, 
pimply recording on the audit form used the specific 
_ hm^^^v^ ^- with --ar--ijrY&t "Ai^^r -i^ptlon of con- 

tent, will be sufficient. Test items covering pro- 
posed outside material can later^ be checked twi^th 
the audit to determine if the material was ^tually 
assigned. ^ 



2. Retrospective Auditing When it is not feasible to 
conduct the audit during the course of training, it 
can be done after the final session on a retrospective^ 
basis. This approach is, essent>ially , a content anal- 
• ysis of the instructor's training log describing the 
proceedings of each training session. ,Majiy instructors 
maintain detailed accounts of subject coverage (usually 
recorded after each session) throughout the course of 
instruction. If this is not the case, the instructor 
should be requested, to do so, if only for auditing pur- 
poses. (The auditors can increase the effectiveness 
of -the audit by providing the instructor with an out- 
line of the types of data they will require.) 

It is possible to employ several staff members to 
serve as Auditors since a retrospective assessing 
conducted at a single point in time. The procec 
are basically the same as those for the Concurrent 
Audit except that the analysis is based upon written 
accounts of each session rather than on 'direct obser- 
vation of instruction by the auditor. The auditors 
/ check each item on' the checklist against the content 
' data detailed in the log to determine the degree of 
coverage giv,en to tl\e subject area sampled, by the item. 

table summarizing 4he results is then constructed 
(see p. 73) for use in the subsequent test analysis. 

Khich approach ^ employ will depend on condition^'^ispecif ic to 
each training program. Where funds and available staff time . 
permit, a concurrent approach should be used since the auditor 
can base his assessment on direct observation rather th-afti having 




^ERLC 



8:5 



1 



72 

< 



to depend on^ the written accounts of the instructor. When 
this is not feasible the Retrospective Audit can be used. 
With the latter approach, the maintenance of detailed, written 
accounts of each training session conducted must be made .an 
integral part of the instructor's training schedule. 

Using, the Results'* 

Regardless of the approach selected, the resulting data will 
provide information concerning the degree to which the test 
measures. a representative sample of the subject matter con- 
tent xmder consideration. While the results of the audit do 
no^ enter into the formal statistical analysis of test dhta, 
the findings should be considered when interpreri-ng the ^ata 
in terms^of trainee achievement and training effectiveness. 
For example, if the audit shows that many of the items were 
only partially cov^^paFor not covered at all, there is reason 
to conclude that the test results are invalid for the subject 
areas being assessed. If, on the other hand, the audit finds 
a high degree of overlap between items and actual course con- 
tent, it can be concluded that the test is assessing those 
subject areas it was^esigned to assess. 

An illustration of the %ay in which' the results of an actual 
Curri^culura Audit were uSed in the assessment of content val^d- 
-■ is presented below. 




The spl^fic tYain^^ program involved was a 7 week Seminar- 
Workshop in Famil^H^anning/Population Program Management. 
An Evaluation Inventory, compose of 85 test items, was con- 
structed to measure trainee competence in three major subject 
areas. The items were grouped into 3 separate item sets. 

The type of. audit employed was retrospective. 
The subject content of each of the 85 items 
was* checked against the detailed descriptions 
of subject coverage for "each daily training 
* session as reported in the curriculum logbook. ' 
The auditing was* conducted by staff members 
who were actually involved in instruction and 
were familiar with the subject material. Each 
item was discussed until all the auditors were 
in agreement as to its "degree of Coverage" 
rating. The resulting data, tabulated from 
• the checklist, are reproduced in the following 
table: 



ERIC 



73 





DEGREE OF COVERAGE 


COMPLETE 


PARTIAL 


NOT COVERED 


SET 1-ITEM Nos. 


1-9; 12; 14-20; 
23-28; 30-32 


10; 22 


, 11; 13; 21; 29 


SET 2-ITEM Nos, 


1-28; 30-34 




23\ . ' 


SET 3-ITEM Nos. 


1-12; 14-19 


13 





TOTAL - COMPLETE: 77 
PARTIAL : 3 
' * NOT COVERED: 5 

Less than 103^(^094) of 4:he 85 items were rated 
as 'not or partially covered by the course in- 
struction. The results justified a subjective 
assessment of moderate to high* content \^alidity- 
It was thus concluded that the Inventory items 
provided a representative sample of the relevant 
subject areas covered in the Seminar-Workshop. 

The "not covered" items were omitted from the subsequent scor- 
ing and the amalysis of test results • This procedure was 
' acceptable as the number of uncovered items was small and the 
total number of items sufficiently large. It is suggested 
that when the audit indicates a moderate to high degree of 
validity, the "not covered" items should be omitted since they 
will contribute little to an assessment of what was learned in 
the training prograA, ' / 

If a large number of items is in the "not covered" or "partial- 
ly covered" categories, it will probably be necessary to 
abandon the Test/Hstest design and write a new test to be 
administered at the end of instruction. This test can be made 
using the Cui'riculum Audit itself as the subject area check- 
list and drawing upon it for i)ew items. (See pp. 74-80 for a 
discussion of the Post-Test, and its uses apart from those , 
specific to the Test/Retest assessment) . 

1 



ERLC 



Ho 



CHAPTER VX 
yjILITY OF THE POST-TEST 



The Post-Test/ like its Pre-Test counterpart, serves a func- 
tion beyond that of assessing the effect of instruction on the 
levels of trainee competence. The Post-Test data, taken alone, 
provides infontiation that is similar to the data derived from 
the type of test given in most academic situations. That is, 
it will indicate, aji individual's level of competence in one 
or a number of specified subject areas. 

If a group of trainees is preparing to assume job positions 
where high levels of competence in the subject matter covered 
in training will contribute greatly to their "on the job" 
success, then the trainees' performajices on aji "end of in- 
struction" test will provide one indication of their job 
readiness. 



Like most "end of instruction" testing situations^ the Post- 
Test results alone will indicate only how competeiit each of 
the trainees is in each of the subject areas being assessed. 
Without the baseline data derived from the. Pre-Test, the 
relative contributions of the sequence of instruction and 
various pre-training experiences to the competence demon- 
strated by the performance of the trainees on the Post-Test, 
camnot be assessed. < ^ 

Although the use of the Post-Test alone is not recommended 
when the objective is to assess the impact of instruction on 
levels of competence, certain situations will arise (to be 
discussed below) in which the only objective data available 
are those which were derived from administration of the test in- 
•strument at the end of instruction. 

Previously (pp. 61-64), the point was made that it will some- . 
times be necessary to make revisions in the proposed sequence 
of Tnstruction. Such revisions will be called for when the 
results of the Pre-Test indicate that the proposed curriculum 
is not appropriate for the current group of trainees. 

Since changes in the course of instruction involve chaiiges in 
subject matter co.verage (see pp. 61-64) , the assessment of 
trainee competence^^ght be directly affected. This is due 
to the relationship between the course material ajid the ob- 
jective test instrument. That is, there is a high probability 
that since the test items were constructed from and reflect 
the content of the proposed sequence of instruction (as 




80 . 




75 



originally outlined in the curriculum plan) , ciny change in 
emphasis on. the material actually covered will create a dis- 
parity between the item content; and the course content. The 
degree of disparity will determine the extent to which the 
testing instrument (administered under a Pre-/Post-Test design) 
is, or is not, valid for assessing changes in trainee com- 
petence in the subject matter of instruction. 

Situations Requiring Post-Test Revision 

An illustration of the type of testing situation that cam 
arise which necessitates varying degrees of revision in the 
curriculum will help clarify this important issue. Consider 
the following premise ^ 

On a test instrument assessing competence in 4 
subject matter areas, a group of trainees ob- , 
tained mean Pre'-Test scores of 94% on Item Sfet 1 
and 15, 23, ajcid 31% on the other three, respec- 
tively. The assumptions (stated on p. 61) are 
held to be valid in this case. Thus, tnere is 
evidence for concluding that the high Item Set 1 
mecm score indicates a .very high' level of com- 
. petence among the total trainee group in the 
subject material b^ing assessed by that item set. 

Based 'upon the above pr^emise, some representative^ guidelines 
for modifying the assessment design in light of necessary 
curriculum revisions can be provided (taking into account the 
general suggestions for maJcing curriculum revisions stated in 
an earlier discussion, see pp. ^1-64). Since the Pre-Test 
items cannot be revised, the ejnphasis w,ill b^ upon changes in 
the Post-Test items and/or revisions in the overall analysis 
of the Pre-/Post-Test data. 

1. The content of instruction in^ the subjeqt area 
assessed by Item Set 1, when possible, caii be 
upgraded to a more advanced, complex level of 
coverage Csee option 3, p. 62). If this is 
done, it will be necessary to construct new 
Set 1 test items to cover this more advanced 
material. These items would then replace those 
items that were amswered correctly by most of 
the trainees on the Pre-Test. These new item|/^^ 
together with 'those original items which mos^i^^ 
of the trainees missed on the Pre-Test would 
constitute a new Item Set 1 to be administered 
on the Post-Test. 



In terms of analysis, the entire set of Test/ 
Retest statistical procedures^ outlined in 



ERIC 





76 



Chapter VII would be applied only to the 
trainee responses to item Sets 2-4. Since 
the Item Set 1 of the Pre-Test will now differ 
from that of the Post-Testi no Test/Retest 
amalysis would be conducted for the responses 
to Item Set 1 (or for the Composite scores 
since the total Pre-Test cina Post-Test do not 
consist of completely comparable items) . In- 
stead , the final cinalysis of the responses to 
Item Set 1 would be conducted on the Post-Test 
administration only. The cinalysis of *ltem Set 
1 data would consist of an assessment of tbje 
number of items correct out orf the total num- 
ber of itsM, both for the individual trainees 
and for theM:rainee group. Since there will 
be no comparable Pre-Test baseline data for 
Item Set 1, the cinalysis will focus only upon 
assessing the trainees' levels of competence 
with the more advanced subject material without 
relating these current levels to the effects 
of Instruction^ ^ 

Although the Pre-Test and Post— Test responses 
for Set 1 are analyzed separately and will 
prQvide little direct evidence of the impact 
of instruction on competence levels, both 
pieces of information ccin be useful to the 
overall assessment in terms of trainee achieve- 
ment. That is, the Post-Test data will show 
how competent the trainees are in advamced 
subject matter M^ile the Pre-Test results 
(for Set 1) display their competence in 
subject material at the level which admin- 
istrators initially considered appropriate 
for the trainee group. 

Thus, a two ste^ analysis is being' effected. 
The Test/Retest^^a lysis of responses for 
Item Sets 2-j^w'tll provide the assessment 
of trainiKg^.^fectiveness (and job readi- 
ness) v^ile the separate analysis of Pre- 
Test a^ Post-Test responses for Item Set 1 
will provide the assessment of trainee ' 
achievement and job readiness with respect 
to competence in a specific subject matter 
area . 




83 



77 



2. It will not always be possible to elevate the 
content of instruction to a more advanced , coip- 
plex level in subject areas where the Pre-Test 
results indicate existing high levels of (pre- 
instruction) trainee competence. This is be- 
cause the content of instruction assessed by a 
specific item set might have been initially set 
at a fairly comprehensive and exhaustive level, 
making an upgrading of the subject material 
difficult, if not impossible. 

In certain situations it might be more appropriate 
either to(l) drop the subject area completely from 
the curriculum and concentrate the efforts of in- 
struction on coverage of the low score areas 
(option 1, p. 62); or (2) retain the subject area 
but give it less coverage than originally pro- 
posed; place greater emphasis on the subject 
areas i,n -which there were low pre-insferuction 
, levels of competence (option 2, *p. 62). 

A chaijg'e in emphasis on the subject matter of 
instruction may or may not necessitate changes in 
the item content of the test instrument. This 
will be^ (3gj^ermined by the nature of the change in 
instructional content. This is illustratted below 
employing the premise as stated on p. 75.- 

a. If the subject material tested by Item Set 1, 
the high score area, is either dropped from 
the ciirricul^om or given less coverage than 
'originally proposed, more time can be devoted 
to coverage, of the subject materials tested 
by Item Sets 2-4 without adding new content 
to the curriculum. Without the addition of 
new subject matter it would not be necessary 
to construct new and additional test items 
for the* Post-Test Item Sets 2-4. 

For purposes of analysis, the Pre-Test and Post- 
Test items would be the same for* Item Sets 2-4. 
The items of Set 1 can either be completely 
deleted fr'om the Post-Test or reduced so that 
the Post-Test Item Set 1 consists only of the 
items in that set \fhich were missed by most of 
the trainees on the pre-testino. * 



♦Ao alternative procedure is to retain all of the original 
items of Set 1 and administer them in the Post-Test. This 
option will be, valuably when the subject matter being test 
by Set 1 is dropped completely from the curriculum. In 




8j 



78 



Since the competence of the trainees* in 
the subject material tested under Item 
Set 1 was shown to be quite high on the 
Pre-Test, these scores will indicate the 
level of competence among trainees without 
relating these levels to the effects of in- 
struction. 



The Pre-/Post-Test analysis would be conducted 
for each of the Item Sets 2-4 and with the 3 
sets combined into a Composite score to assess 
the impact of the course of instruction on com- 
petence. 

i> 

> The open instruction- time which results when 
subject material in hi^h score areas (i.e., — 
Item Set 1) is dropped or given reduced 
coverage, can be employed in a coverage 
new subject material added to supplement th^ 
original subject content in the areas being ' 
tested by Item Sets 2-4., Then, as was done 
with the content of the original curriculum, 
objective items would be constructed to test 
the trainees* competence with the new material. 

The original items, together with the new 
items would be administered as the Post-Test. 
The original items are to be analyzed sep- 
arately from the items constructed to cover 
the new subject material. Therefore, the 
Post-Test item booklet should be constructed 
so that the original items are administered 
in their original sequence and the new items 
.are presented^ at the end, listed according 
to item set. 

Since the new items will be administered 
only at the end of instruction, the^ can 
be no Pre-Test baseline data for th^e items. 
Thus, the analysis of trainee perfommnce with 
th€ new items would be limited to a f&quens;; 
co\mt of the total correct ^ut o5 the tt 



this situation, the Item Set 1 would then serve as a 
trol to determine the changes that occur in responses to 
items administered under the Test/Retest design without 
the influence of a mediating instructional sequence. A 
separate statistical testing of the Test/Retest data for 
i^em Set 1 would determine if the score change that occurred 
was, or was not, simificant. 



ERIC 



90 



7? 



number possible on the Post-Test. From this 
certain conclusions can be drawn concerning 
the competence of the trainees with the new 
jf ' material without, however, the capacity to 

re lat e the levels of competence to a specific 
learning experience (which in this case would 
be the sequence of instruction) . 

^ The item responses for each of the Item Sets 
2-4 would be subjected to the standard Pre-/ 
Post-Test analysis, both as individual item 
sets as well as a combined, composite test. 
The results of this analysis would indicate 
the effectiveness of the training program in 
raising the levels of-<:Qmpetence among the 
trainees. 

The importamce of the Curriculum Audit has been illustrated 
in a previous discussion. When revisions in the original 
course of instruction are required, the Audit will prove 
particularly valucd^le for monitoring changes in the relation- 
ship (i.e., the discrepancy) between the course subject 
material amd the item content. The original Audit should be 
updated with newly constructed items and a detailed assess- 
ment conducted to determine the degree to which the items 
cover the subject matter they were designed to cover. With 
the deletions, additions, and reduced coverage of subject 
matter that may occur, it is very important to determine the 
degree to which the course content presented is actually 
covered by the items comprising the Post-Test instrument so 
as to maintain the integrity of the assessment study. ^ 

NOTE : The results of the administration of the Pre-Test will 
not be the only factor to suggest needed cJ^ge in a proposed 
course of instruction. Another potential soikrce of change can 
be foxmd in the structure of the training experience. 

The traini'ng experience should not consist of a series of 
instructor's lectures on fixed topics with the trainees as 
passive assimilators of relevant information. The instructor 
should be as responsive as possible to the specific educational 
needs cmd demamds of the trainees. ^ 

X - 

A responsive educational experience should consist of an 
interactive balance amoifg; lectures , recitations, discussions 
and question and cmswer s^S3ions. The various dontributions 
of the trainees can, however, have an effect on the sequence 
and content of instruction* That is, discussion and questions 
on the part of the trainees may direct the course of instruc- 
tion away from the proposed subject area coverage to a more 
peripheral, but relevant, topic. (Negative factors such as 



Er|c * . 91 



80 



trainee boredom atnd disinterest may also dictate a change in 
subject matter.) Such flexibility in training structure can 
in most cases, enhance the effectiveness of the educational 
experience. 

When content changes , occur they should be noted and assessed- 
during the Audit and, depending upon the type of revision 
made (i.e., deletion, addition and/or reduced coverage of 
subject matter) , the test instrument and subsequent analyses 
procedures should be revised according to the guidelines 
provided in this chdpter. 

9 



r 



CHAPTER VII 

ANALYSIS AND INTERPRETATION OF'RESPONSfe DATA 




The statistical analysis of testing data can bje "conducted 
either by manual/mechanical methods, or by computer, empl*oying 
the progrcuns provided in Appendix E. The statistical proce- 
dures^llpd computations suggested here are not complex and can 
be carried out easily, using prestructured vorksheet3 and a ' 
standard desk calculator. It should be emphasized, l^owever, 
that hand tabulations and mechanical computations are subject 
to numerical errors which often go undetected, and can be time 
consuming, especially when the nximber of trainees ox test items 
is large. 

For those with ready access to computer facilities and person- 
nel, a programmed analysis should.be considered. Such an anal- 
ysis requires less staff time and tninimizes the probability of 
computational error. Regardless ofi the system employed, the 
quantitative procedures comprising, the ajialysis will be the 



same. 



Overview 



\ 



This chapter considers the statistical analysis and interpre- 
tive assessment of data obtained from the combined pre-/postJ- 
instructron administrations of the test instrument. There are 
three^leve^s, or stages, pf this analysis, each level deter- 
mined by wh^ether the focus is on subject area scores (item set 
and composite), individual trainee scores, or individual item 
responses. *The discussion of the analysis process follows 
this stepw^sp progression. J 

' ' ' \ * 
For the stages involved with assessing the significance of 
score , changes! from Pre- to post-Test (Stages 1 & 2) , several 
hypotheses arg, presented, derived from a set of questions 
which the analysis purportedly will answer. , The specific stat- 
istical tests appropriate to hypothesis testing in the first 
.stages, as well. as the quantitative procedures employed in later 
stages of analysis, are provided here, together with soma basic 
analytic ass^umptions and underlying statistical theory. The 
discussion of each stage will be supplemented with procedural 
examples etnpioying the Pre-/Post-Test data derived from the 
second field application of the methodology, conducted at the nationcil 

• 93 . - 

O • • , 81 



82 



School of Public lifealth in Renn^es, France.* The computational 
formulas, statistical tables and worksheets, and stepwise pro- 
cedural guide^^es, together with sample results employing the 
Rennes data, are presented in Appendix F. 

The computer programs are fully documented in Appendix E, and 
therefore will not be given extensive discussion in this chap- 
ter. However, as each analysis stage is presented/ references 
to the appropriate programs will,. be made. A procedural flow- 
chart for a computer-run analysis is given in Figure 9. Jhis 
flowchart is provided to indicate the temporal sequence in 
which' each of the programs is to be run, as well as to seTfwe 
as an illustrative overview of the analytical design. There- 
fore, although developed foV computer purposes, the flowchart 
can 4IS0 serve as a general \analysis plan to guide those employ- 
ing non-computer means. ^ 

Note on Computer Usage ; It will be necessary to determine 
if a specific computer systerf can accommodate the FORTRAN IV 
programs as written or if, changes in the programs will be 
required to provide the recommended output. VJhen considering 
a computer-run analysis, it is suggested that the programming 
requirements described in Appendix E be discussed with person- 
nel familiar with the capabilities of the computer system to 
be used. 



ANALYSIS OF DATA 



Applicfation of Statistical Tests for the Significance of^^Pre- 
to Post-Test Score Increases Z 

General Considerations The primary questions, the solution of 
which comprises the first two stages of the analysis, involve 
whether or not the Post-Test scores have increased in magni-^ 
- -tude x)ver ^heix Pre-Test counterparts (on item sets, both for 
the trainees individually and as a group) , and whether or not 
any of the score increases are statistically significant. 



* The score and item^^response data illustrated here were 
derived from a 113 rtem siibject competence assessment 
instrument^ administered under a Test/Retest design to 31 
health professionals and paraprof essionals attending a 
seven week training program in Health and Family Protection. 
The composites test was subgrouped into three setSs^of).51 , A5, 
and 17 items. Scoring was on a 1 point per it6m basis. 



ERIC 



94 



83 



FIGURE 9 



PROGRAM FLOWCHART FOR COMPUTER AlCALYSIS 



PRE-TEST 



POST-TEST 



examinees* 
item responses 

(coding forms) 



examinees* 
item responses 

(coding forms) 



r j.^'tem responses 
1/ (in-decJc) 



® 



Item set + 
composite: 
raw scores, 
means c std. 
devs^ 

(out-list) 



® 



J item responses 
I (in-deck) 



test/retest 

item ^ 
response* 
patterns 

(out-XifSt) 



/ item set + 
composite scores 

(in-deck) 



® 



® 



item set + 
composite: 
raw scores, 
Tieans 6 std. 
3evs, 

(out-list) 



/ item set + 
[composite scores 

I (in-deck) 



1® 



X test/retest 
I item set + com- 
Iposite scores 
Imerged in-deck) 



® 



t-test 
decisions 

(out-list) 



chi 
decisions 

(out-list) 



COMPUTER 
PROGRAMS 

COMSCOR (5 

TDEKMERG @ 

TRELVAR ® 

CH1**2 0 

ITEMPAT ®. 



ERLC 



9. 



84 



^ queries can be translated into a general hypothesis 
. that m turn can be partitioned into a set of sub-hypotheses » 

lltt^l^^ the application of appropriate tesS of 

statistical significance. L.caL.a uj. 

In order to make the discussion of specific statistical tests 
more comprehensible to those not familiar with hypotSJsis test- 
ioLpnJc. i''^ induction, or inference, several important 

concepts are introduced here. (Those wfth experience in the 

S^aae""! 1^.? ^ests of significance can skip directly to the 
btage i iev6l of analysis, on p. 89. 

K^r'^''^^ Inference A set of trainee test scores (on 
both Pre- and Post-Test) should be viewed as a sample 
drawn from the population of all possible scores that 
would be obtained if the same trainees take the same 
test an infinite number of times^ (assuming that on each 
test administration, the trainees knew only what they 
4 knew^the first time the test was taken). The mean com- 
,f pute<? for this hypothetical infinite ffet of scores is 

the population mean. Correspondingly, for each of ^the 
,V ^ infinite number of sample scores, of which the actual* 
scores found are one example, a sample mean can be com- 
pated. 

While any of th^ sanple means can have .the same value as the 
population mean, only 'occasionally will this happen. 
Chance factors such as variations in the auccess of ran- 
^ dom guessing. 6ccurring across samples will contribute to - 
differences in, sample m^ans. Consider now the situation 
where examinees are/administered the "same test instrument 
before and immediately after a sequence of instruction. 
In all likelihood, whether the instruction had an influ- 
ence, on thfe Post-Test oerformances or not, the test scores 
for the two administrations would have different means 
The que^stion to be answered is if the differences in Test 
ahd Retest score m^ans are attributable to the interven- 
ing effects^f msttuction or to chance fluctuations of 
sample mean^ about some common population mean. 

^ The application of tests of statistical significance will^ 
attempt to determine if the observed differences between 
Pre-Test and Post-Test score means reflect actual differ- 
ences m the levels of subject competence underlying Pre- 
Test and Post-Te^t performances or simply * random score 
fluctuawons. 

The specific statistical test employed takes random* chance 
factors into account. If the test results show that there 
IS a*- low probability of a difference as large as, or larger 
than, the dne observed occurring by chance, then the dif- 
ference IS ^id to be statistically signficant. 

Er|c 9u' 



85 



Depending on the type of research hypothesis that was 
set ut> as an alternative to the null hypothesis, the . 
re^^ction of the null hypothesis will provide the basis 
for inferences concerninq the ef feet ,of instruction -on 
levels of subject matter competence. . . ■ ^ 

The N'ull and Research Hypotheses The null hypothesis 
' (Ho) is what is tested by the application of 't^sts of 
statistical signrfibance. In order 'to determine whether • 
the Pre- and Post-Test means are significantly differeat ^ 
from^each other, the strategy employed is to test the 
hypothesis that the ?neans are derived from^ two random 
samAes drawn~ from the same popula^tion .« 

For^he first stage of analysis (see p. *89) , the null 
hypothesis will state that training efforts had no 'effect 
on increasing cognition competence. The hypothesis will ^ 
be tested by comparing the Pre- ^nd ?ost-fest score 
means. 

For tHe second stage {see p. 90), the null hypothesis 
states that the individual trainee's l^vel of competence . 
was not increased as a result of training. (Operation-/^ 
ally, the hypothesis being tested is that the^distribu--/ * 
tion of correct and incorrecli responses for each trainee 
on each section of the test is independent of the time ^ , 
of test administration.) 

» 

The research hypothesis (H]^)* is set up as an alternatfive 
to the null hypothesis. It is formulated to provide for 
a definite decision when the null hypothesis is rejected 
and is based on some outcome pfedicted by a theory or 'by 
the researcn interests of the investigator {see exam{l)le 
below) . , 

Qperationalized HypAtheses The null Kypb.thefeis, as de- 
scribed above, will be stated in a general form: that 
instruction produce^jno effect on levels of subject 
matter tcrompetence. In order to* test the hypothesis, it 
must be stated in operational form. For example, the 
first null hypothesis to be posited for tes<:ing is that 
the mean Pre-^est score •is equal to the mean Post-Test 
score for Item Set 1 (i.e., Hct^MtI =^Rl) ^ The research 
hypothesis will be so stated that, if the null hypotheses 
IS rejecffed,. evidence will be provided to support an out-> 
come predicted (or desired) by, the investigator r 0 

I With the type of analysis to be described here, the^di- ' . 
rection of the desired mear? differences is posited in 
advance and is important' to tlte evaluation objectives for 
which the statistical tests were initially called into 
use. Assuming that all the proper bontrols, discussed 



ERIC 



9. 



V 



1". Cr.ar>t.er II/ r.ave ceer, erp'^c'ez, siqr.i f icar.t increases 
±r Post-Te^t scores are a ccwerf-1 ird:^atcr tr.at tr.e 
Secuer.ce of ir.str-cticr. was effe'ctive ^r. raisirc tr.e . 
^ a* • - s ' 1 ^ o* s c * ^ ^ " A - ccr^oe t.e r ce . Tr.ere*- 

rore, the researcr. »r.yrct..esi£ vi*. ce cirecticr.a-. . 

Ir. tr.e exarcle cited abcve/ tr.e . reseaTcr: r./P'C tr.e sis voulc 
state t.'^.at trai.r.iV.c dees ir.crease tr.e tra^rees' 'level'-s 
of subject ~atter corceter. ce. Oi^erational ly , Hi wc-ld 
ZQ zr.az tr.e ?ost--'est rear, is sigr.if icar.,tly r.i gr.er zr.ar 
t.-.at 0' the ?re-Te5?t for Iter Set 1 i.e., '^I'M z'/<^ ?^ ^ 
Sir.ce tr.e direct i or. of tre *s*:ore deviaticr. car. oe h\po- 
•tr.e'sized i:^' advarce/' tr.e test e^^-pl-^yed shoild be one- 
sided. That IS, tr.^ alternative .nypot.nesis should de- 
clare %ot/ cr.ly t.nat o;'.e score rear will oe sigr.i f icar.tly 
greater ^char* tr.e ot.ner , a^t it sr.ould also specify vhic?r 
score r/ear. it i£ 'ir. this case, tre rear, for the ?os«- 



iesi 



The table of t-values provided ir. riqjjre T2 or .-.pj-er.dix 
coTcide^s or.l" the one- tailed test. 

4*. Interpreting tne Statistical Test Results Tne cjestio.n 
to be ar.swefed is wr.et.ner or not tr.e testirfg outcore 
' [justifies t.'^e rejection of the n-11 hyo-othesis m favor 
of the alternative hypotnesi3. Tne conclusion drawn 
fro?" t.ne test results is referred to as a "udcrrent of 



signi f >cance , \Such a ^ud-Tnent is cased uoon the pfODabi- 
lity of obtaining,, r^y c nance, results as discrepant as 
those actual>ly observed, when m fact, the null .nypotn- 
esis IS true. In the^case of Te$t/Retest scores, the 

* * i^udcrrt^nt is based uoon the orcbabilitv of cettma a mear. 

^ for individual trainee* score difference as. large as the 
difference actually, observed. If the probability is low 
"^qag^h, then the null hypotihesis that such large score 
cifferer^res w;ere ach iel*ed oy pure chance ^s not accei>ted^ 

> i 
i i : 

Although the one^ailed te5t is appropriate for the analysis 
proposed here, it should be noted that another approach is 
sonietipes* employed . That is, the alternative hypothesis 
could* sirolv state that the*traininc: does influence levels 
df s'ub ^ect • conpe tence without suggesting the airection 6f 
J^at influence^. •Operationally, the Hi would state that the 
^fem Set 1 nean^ scoi^s are not equal — i.e., ^*1'>Ut1 ^MrI* 
A statistically significant difference m the me^ns, whether 
larger or smalifer, from Test to Retest would reject the null 
hypothesis -m^-avor^of Hi- This is an example of employing 
ncp-di^ectiooal , two-tailed test. 



3- 



It car* therefore be inferred if tr.e alternative hypc- 
thesis IS directional m favor cf Post-Score gams that 
the sequence of ir.srructioa contributed to tne r-a^.LtJce 
of the score differences. 

• 

After bo^h tn.e null and alternative ny?othesis nave been 
.posited, the next step is to ccrpute me probability of 
obtaining score differences as great as or greater tnar*: 
those obser^-ed on the assjnpt;.on that tne r-ll nypcthesis is 
true and that vnatever differer.<^s were oo served were due to 
chance. rojr Stage 1, tnis probability will oe derived^ 
fron the application of thje-t-Test to the zean Test anc 
. i^etest scores -the probability value being the value ^or 
the coEa:>uted t-statistic) . 

For Stage ^2, the probability will be tr^ value ^-A^2 

statistic computed frori tfee -application of the X 
Te^t to the individual trainee Test/Retest scores. 

The question that now arises is 'how srvall a. probability 
is necessary%in order' to consider the score differences 
significant"?- The- probability of a chance occurrence ' • 
can be' of any xsagr^ if urde berween 0 ahd 1 (i,e,, O indicat- 
ing that an ouTCOnie as discrepant as that observed could 
not happen if the null hyr^othesis were true, and X mdi- 

fing that an outcone as discrepant would be certain to 
oefKon the basis of the null hypothesis) , It is there- 
's necessary for the evaluators to decide upon so3ie 
ticularN^bability value (called the level of signifi- 
cance) wbicnW\ay consider is small enough so that they 
will feel confident m inferring a highly effective train- 
ing secnience .^at th^^^i^l of subject laatter competence) 
when the computed probablriity is less than this selected 
value. The decision as to which particular probability 
value to employ will be somewhat arbitrary and can vary 
from one evaluator to the next. 

One evaluator may feel that af the observed Test/Retest 
score dif-f^rences would occur* by chance with a probabi- 
lity of 1 in 10, then he is 3ustified in inferring that 
the training i^ effective. Another might accept only a, 
probability as low as 1 in. 20 in order that the score 
differences be accepted as significant. Still another 
may only .accept a probabilitvas low as 1 in 100; and 
some may require a' -probabili^y"^^ low as 1 in 1000. 

The position taken in this Manual is that instruction 
should have a strong impact on post-competence levels, m 
terms of large increases in Post-Test scores before such 
data can serve as evidence of training effectiveness. 
Therefore, the minimum standards for a statistical judg- 
ment .should be as follows: 




88 



?efer to statistical test resets with a pro- 
bability crreater tr.ar. .C5 as not significant; 
to those VI th a proi?ability e^^al to or less 
than .05 'i.e., 1 in 2:; nut greater than^.:;! 
as significant, or as sicnifio-ant at the 5%^ 
level: and to tnose witn a probability equal 
to or less than 'i.e., 1 m IOC. as highly 

significant, or as significant at the 1% level. 



NOTE ; rne ar-ove discussion is only a c-rsory presentation of 
Ini^concects of statistical inference as they relate to t^e 
types of analyses descrii^ed m t:his chaoter. Although the 
tooic will be referred to throughout this chapter, ruany of the 
ass-unotions and orincipl^s unde^rlymg the specific statistical 
tests* and procedures saployed m the analysis are beyond the 
scope-v of tne Manual, For a :&ore canprehensive coverage of 
the area of statistical testing, the interested reader should 
consult one of*rhe statistical texts cited m the Bibliography. 



39 



STA GE 1: SUBJECT APZA SCORES 
7 ^ 

The first are^ of ar^alysis is the ?re-/Post-Test pei^crz^r^ce 
of the total trainee group on the test: as a vhcle and on iter 
sets. The as3unption underlying cr*is stage of analysis is 
that significant* increases in levels of conpeter.ce anong 
trainees as a result of training should first r^nifest then^- 
selves as significant increases in Post-Test scores over 
Pre- Test scores, for individual iten sets and the composite 
test. The logical question to pose at this stage then is 
whether or not the increased competence involves all sub:ect 
matter covered during the co'urse, or only selected areas. 

The amalysis question is whether or not the item set and 
composite Post-Test scores show statistically significant 
increases ^er their Pre-Test counterparts. Since in a Pr^-/ 
Post-Test situation there may be factors" other than the 
effects of instruction that could cause score increases 
(such as random Score fluctuations), it is necessary to^ ^ 
apply ^ selected statistical -test to determine the signifi- 
cance of any observed score increases. . - 

consider, for example, the test scores illustrated in Figure 
10. As. can be observed, all sub3ect area' scores show positi^-e 
mean score gains. Whether or not these gaiAs represent "signi- 
ficant score increases or simply a ^function of chance factors 
can be determined by the application of the appropriate test 
for statistical significance. * 



Figure 10 

MEAN PRE-/POST-TEST SCORES FOR ITEM SETS ^ND COMPOSITE TEST 



1 

Sub-ject Area 


Pre-Test . 


Post-Test 


% Gain 


I^ossible Score 


Item Set 1 
* 


23.77 


35.06 


47.4 


51 


Item Set 2 


23.87 


30.94 


29.6 


45 


'item Set 3 


6.74 


9.13 


35.4 


17 


Composite 


59.39 


75.13 


38.1 


' 113' 







TJie statistical test selected for this level of analysis is 
the t-Test for Relat P^^ Variables. 

(A variation of the standard t-Test is included in Appendix 
F. It was designed by Dr.. David Wolfers, of the Institute ^ 



ERIC 



10 



9a 



staff, sp^c.ricall/ r sppL; 

Ti'SV 2jr. d * 2 S V 6 - , -J". Z^S'i^Z SZ : 

1 "".c 1 -ds - ""^ c r -T^ c ,^pc c-f 
in tr.e aralysis. ?C"-ever, 
tr.at tr.cse wif. sxp^rie^re 
sec r *5 d a * a rri-". i a^d 
cor^^ JLT. z. c a * * 6ir fi"diLr3S z' 
cz tr.e stardard tes": := = , 
a-- 1 val-aclc ic ir. ^ ar-ra cf 
i^^slisvsd 1 r a * ccll'Scti*- 
prove Its -til:-:.- ar.d •. 
riev» tec ."-Tk IC-S/ ircl-dir 



"c - A • a 



. • - ^ » < 



The sccre -istric-ticr.s rec: 
'see Fic-re 5 will prcvide 
axalysis. Fcr eacr. :te~ se: 
the hypctr.es IS tc ce tested 
ecual tc the ?cst-Test zear. . 

te al tezTiati^ve 



case* aga:.n^t 
sic^r.if icar.tiv create r tr.ar 



• tr. at tr.e rre^.est 
Tr. Z.S Villi ce testei 
inat tr,e Pcs^^-rest 



The conp~Jtat icr*a 1 formulas fcr tr.^ st^'» 
criteria for so.cT.if icance , tccetr.er «it: 
given in Appendix F, • 



?i ..^5"_e r^n 



When 
be r 
for 



all t-testmg has ceen ccrpleted, tne decisions inc-ld 
ecorded on ar. Analysis Sorcr^ry Profile >see Fic^^ra 13* , 
use m subsequent reporting of tne ass^ssnent. » 



PROGRAMS TDE:<M£RG & TRELVA? ARE VSED FOR THIS STAGE Or inE 
. -ANALYSIS (see Appendix E)** 

STAGE 2: ' IKDIVIDUAL TRAINEE SCORES 

When the objective is to determine if mdividaal train.ees 
have increased their levels of competence from the pre- to 
the post-training period, the statistical procedure employed 
is the Cha Square Test of Independence , This test is applied 
to item set and composite score pairs tor each trainee. The 
3core data can be expressed m a 2 X 2 Contingency Table, An 
example of this type of table, containing the Pre- and Post- 
Composite scores for traine^e -01 (see Figuri^ 11) , is shown 
below, (The computational formula and criteria for signifi- 
cance for the Chi Square Test, using data from the Table are 
provided in Appendix F,)" 




ar-alv5is stice is ir.e :nc.vir'-^. trair^ s Pre- Test ir.d rest- 
:rcre S^tniar;. Profile f.r.r^ : . 

Sir*c^ t-T.e Cr. 1 ST--;*re» test is a:ppl:.ei repeatedly- to eacr. s«t 
of Pre- j^-^ ?csi-reat scares fc'r all trair^es. it is succesrt^&d 
^/at earr. rC':tpctat itr cr.ec«"ei ry ar ir.dep^r.der.t evalj^atcr. 

Vhen al*! testir.g is jcc— letel, tr,e r::r.cl -sicr.s t igr.if ic&r.t 
cr nor.-s irr.if icar.t s r. c - 1 - re^ccrded vith the ether data 
on the Analysis Sjrtrary Profile, sr.eet Fierce 13 . 

** FROGHAM CH:**2 IS ZSZZ FOR THIS 'STAGE OF DATA AKALVflS. 
. (see Apper.dix E*'* ' , 



STAGE 3: ITEM RESPONSE PATXErC^S 



The first two stages Qf analysis involve rhe application o- 
statistical orocedures for deterrnininc the significance o^ 
Test to Retest score changes by sab:ect area and for individ- 
ual trainees. The ob:ective of the final stage is to isolate 
the factors which contributed to these changes. The aata . 
must be analyzed i^em-by-'item, categoriziag each item accord- 
ing t» the type of response pattern .that occurred from Test 
' t6 Ret;est. 'There are four possible patterns: 



(1) 
(2) 
(3) 

(4) 



^, Correct in both Pre-Test and Post-T^st (C--4C) ; 
2) Incorrect in both Pre-Test and Post-Test (I-^I) ; 
Correct in Pre-Test and Incorrect in Post-Test 

Incorrect in Pre-Test an^ Co;rrect in Post-Test 
(I-*C) . 



-= t-e iss^'-tti:* :: ire rcr.tr.r .tic^ 'ictr s-cress- 



:rei as. a result 



.r.e ever a-- r^7r.:.t^3e c: tr. 
C->: -ters. Wr.ile tr.is typ^ 

sr.if-^caj- z-e rcrs-dere 
\ s £ —n's d - TJ^ I r r» i r. w e s 
•s ca-, niade r.6w 
correct respcrses res-i^ed 
ratr.er t.r.ar. fror "ir-e : 
The > I Iters ray 'cor.tair. 
S-Tjctjre cr cor. ter.t. Wr.er 
var:.atior.s it. response tc t 
n^xt are r.igMy probable, 
training riay produce cor. f us 
That IS, atter'pts to apply 
problens posed by the Post 
""number of incorrect respons 
mg, with little or no unde 
The C->r pattern can be the 
ciated with ob;]ective-type 
respond correctly or of ove 
structure and intent. 



.creases -s 
r r r. d c-e s c c c 



re r e S€ 



IS 



orpetence wit.' 
anbig-ities oi 
faulrv itens 



ec: z'/ tr.e 
Test -r^etest 
Whi le zh^ 
■ it ca.'unot 

Soir^ ' 
\at tr.e ore— 
Sl^':c or naive reason- 
tne sib^ect naterial. 
irregularities m 
re included m a test. 



:c zsi t ^ e 



.hese itens fror one testing to the 
The new learning acquired daring 
ion, especially on faulty items . 
new info mat ion to the questions or 
Test Items may result m a greater 
es than would be obtained by guess- 
rstandmg, of the subject material. 

result of the common problem a^so- 
items, of "knownmg too f^uch»" to 
r-analysis of items of questionable 



In order to conduct an adequate analysis of item response pat- 
terns, the Item response data shoi^ld be displayed in tabular 
form for each item set and the qomposite test. The types of 
tcibles required are those shown in Figures 11 and 12. The 
data in thfese tables are tabulation^ of the Test-Retest re- 
?es of the 31 trainees *to the 51 items of Set 1. (While 

Lscussion of the analysis will center on one item^ subset, 
i^rocedui^es involved a*re the same for all item sets and th^ 
>site te^st.) ' ^ 

To *construct such a table, work with the respohses on the 




ERLC 



10 



93 



1T&( R^KJSSE PA3TERBS 
SY IKDiyiOTXL TRAIKE2 



rP/Mca Progro l (U/73-12/73) 





(1) 


PRE > POST 


(2) 




PRE-— 


^ POST 


TOTAL 




TOTAL 






TOTAL 




• 




C0RRSC7 


C — ^>C 


C — >I 


I5CORRECT 


I — > I 


I — > C 


ITQiS 




PRE 


POST 






PRE POST 






f 1-+-2 ) 


1 


28 


34 


24 


4 


C i 




13 


10 


51 


2 


30 


42 


26 


4 


2 1 


Q 


5 


16 


51 


3 


13 


19 


6 


7 


JO 


32 


2S 


13 


51 


4 


26 


43 


25 


1 


25 


8 


7 


18 


51 




23 


* 36 


19 


4 


28 


15 


11 


17 


51 




29 


37 


28 


1 


22 


14 


13 


9 


51 


7 


33 


33 


28 


5 


18 


18 


13 


5 


51 


8 


23 


31 


16 


\ 


2? 


20 


13 


15 


51 


9 


28 


32 


22 


6 


23 


19 


13 


10 


51 


10 


29 


34 


25 


4 


22 


17 


13 


9 




11 


25 






4 


26^ 


17 


13 


13 


51 


12 


13 ' 


^ 12 


1 " 


38 


22 


21 


17 


51 


13 


18 


35 


15 


3 


33 


16 


13 


20 


51 


14 


22 


3"^ 


17 


5 


29' 


16 


11 


18 


51 


15 


♦ 24 


36 


20 . 


4 


27 


15 


11 


16 


51 


16 


16 


36 


11 


5 


35 


15 


10 


25 


51 


17 


31 


41 


30 


1 


20 


10 


9. 


•11 


51 


18 


2C 


36 


17 


3 


31 


15 


12 * 


19 


51 


19 


19 


36 


16 


3 


32 


15 


12 


20 


51 


. 20 




41 


20 


1 


30 


10 


9 


21 


51 


21 


I 29 


37 


25 


4 


22 


14 


10 


12 


51 


22 


8 


31 


5 


3 


43f 


20 ' 


17 


26 


51 


23 


27 


35 


22 


5 


24 


16 


11 


13 


5i 


24 


29 


36 


23 


6 


22 


15 


9 


13 


51 


25 


28 


38 


26 


2 


23 


13 


11 


12 


51 


' 26 


2a 


38 


21 


2 


28 


13 


11 


17 


51 


27 


27 


2% 


15 


12 


24 


22 


IQ 


14 


51 


28 


29 


39 


25 


4 


22 


12 


8 


. 14 


51 


29 


18 


38 


16 


2 


33 


13 




22 


51 


30 


22 


35 


^6 


6 


29 


16 


10 


19 


51 


31 


26 




23 


3 


25 


20 


17 


8 


51 


TOTAL: 


737 


1087 


615 


122 


844 


494 


372 


47 2 


1581 




46.6%- 


68.8% 










.-55,9% 


- > 100% 




ioo%<. 




-83.4% — 


—16.6% 


100% 4- 




--44.1%- 





ERIC 



io:i 




original Test and Retest answer sheet^. Transferring the data 
from the answer sheets to the final table can' be facilitated 
by means of an item response worksheet. A section of the work- 
sheet used to construct the tables in Figures 11 and 12 as well 
as the steps inyblved in constrcuting these tables from the 
worksheet are fully described in Appendix F. 

It should jDe stated at this point that although the data used ^ 
are quantitative, the analysis of item pattern responses is 
essentially qualitative and subjective in nature. It results 
in tentative assumptions ra^ther than substantiated conclusions 
derived from rigid statistical testing. 

In order to illustrate how^the analysis should be conducted, 
an actual situation is described. The following is a summary 
of the analysis conducted on the data in Figure 11. In this 
situation, application of tests of statistical significance 
found that the Post-Test scores displayed significant gains 
over their Pre-Test counterparts. An item response pattern 
analysis was conducted in an attempt to isolate the factors 
which may have contributed to this significance. 

The percentage of total test items having correct Pre-^ 
Test responses is 46.6% with 53.4% being incorrect 
s^re-Test items. Pre- to Post-Test response patterns ^ 
were analyzed by comparing changes occurring among 
pre-correct items with those pre-incorrect . 

83.4% of the pre-correct items were also correct on 
the Post-Test, compared with 44.1% of the pre-incqrrect 
items tjiat were also ii&correct on the Post-Test. This 
shows a^greater degree of response stability for the 
pre-correct than for the pre-incorredt f rom Test to 
Retest. That is, the number of pre-inc)corect items , 
that change from Test to Retest is greater than those 
initially correct. 

/ 

The percentage of total pre-incorrect responses that 
became correct on the Post-Test is 55.9% which repre- 
sents the total amount of Pre- to Post-Test Scare 
increase. This increase is reduced, however, by a 
. downward shift of 16.6% of the total pre-correct to 
post-incorrect responses.* 



* The influence of these opposing patterns on relative Test/ 
Retest score change can best be illustrated by considering 
an individual case (see Figure 11). For example, on Item 
Set 1 trainee No. 1 displays Pre- and Post-Test scores of 
28 and 34, respectively. The score gain of 6 is the result 
of two opgpsing patterns of Test to Retest item response 
^ shift* ^The number of I-^C items is 10 and the number of 
r^ni/^"^"^^ items is. 4; from this the net score increased is cal- 
tl^^culated as 6 — i.e., 10-4.) 



4 



^95 



This O-^I shift is probably to a large extent some 
type of testing artifact which had a negative effect on 
overall trainee performance on. Item Set 1. While the 
available data do not permit a determination of the 
cause(s) of this .artifact , the fact remains that it 
does reduce the overall score gain in Item Set 1 to 
a certain degree. The' net increase in Post-Test 
scores was of sufficient magnitude, however; \to prove 
significant when subjected to statistical t^ing. 

■Such an analysis shoul^Qe conducted for each item set and the 
<:omposite test. In te'wfe^ the data provided, the breakdown 
of Pre- to-Post-Test r^ponse behaviorXnto discrete response 
pattern categories 'will be relevant ^the assessment-of- 
training study being proposed. Thg analysis of item patterns ; 
provides not only a quantitative indicator of the amount of 
learning that has occurred from the|pre- to post-instruction 
period (i.e., the magnitude of the I-*C shift-), but also data 
concerning those, factors which contribute, both positively and 
negatively, to the net score gains, and that portion of re- 
sponse behavior where no Test to Retest change occurred. 

An item sel^ and composite response pattern analysis is even 
more rel'fevant when the Post-Test scores fail to display signi- 
ficaHt increases over their Pre-Test counterparts. In this 
sit'uatidn, the analysis would attempt to isolate those item 
response factors which contributed to the lack of significant 
icore increase. The assumption underlying the analysis in 
this case is that non-significance of sdore gam may be the 
result of negative factors related to tdst performance and not 
only to the failure of the trainees to increase their levels ; 
. of subject competence between the pre- /and, post-instruction f 
^. periods. / ■ > / 

f- In addition to the distribution of response patterns for tolll 
item groups, the data in that same table (in Figure 11) provide 
item response patterns for. each individual trainee. An analysis 
of the relative contributidn of each response pattern to the 
trainee's post-scores will, like the analysis of it^m groups, 
help isolate t^ose test factors that contributed to the train- 
ee's achievement of (or failure to achieve) . significant score qains. 

Test/Retest item response patterns can also be assessed by 
, individual item. A table of response patterns generated for 
each of the 51 items of Set 1 is illustrated in Figure 12. 
For each item, comparisons of the frequencies in each response 
pattern category will indicate the relative contribution of 
that item to the Pre- to Post-Test score change. Again, since 
the focus is cA score increases, the item patterns involving 
change (I-»C and C-»I) would be examiped. Those items contri- 
buting most to the Post-Test score increase ^ill be those with . 
the hiqhest fflibency of response change m the I-»C category. 
For example, Itim No. 27 (with C-»I and I-»C frequencies of 

ErIc . ■ iOV ; 



CO 

H M 

(Nl < 

U D 

S(0 Q 

2 M 

D O > 

O CU M 

M O) Q 



(0 
U 

a w ^ 

rt* 2 tN 
H o + 

O CU 



o 

U I 



H H 

u w 

u o 

^ < K CU 

tNi H ce: 
o 



H CO 

U O 

^ < Ei3 CU 
-H H K 
OS 
H O 

CU 



IvOt 



4^ 



iOo 



V 



" asz 2" resp-ectively * r.a<es a greater relative contribution 
tc tre Test Petest score increase tr.ar. Iter No. 9 (with a C-^I 
frec-ercy cf 1 ! arc a- :->C of A corparision of each 

iter ^ resteer se patterns vitr. tr.ose*^f every other iteit m the 
table '-'c-lc ce ti^e cor;sjrir.g ar.d unnecessary smte the nosr 
relevant iten response pattern assesslcents are tnose at the 
Iter crojp iter: sets and composite tfi^t) and individual 
tr^^^ee levels. Ho'vever, tr.-e evaluator snould exanme data 
^ •^^^W:-'^^ cenerated for t.ne item tai^les i.e., the table in 

c ottai' a "cre complete picture of Test/Retest 
'^?tc'"S's c^navicr do to and mclJdir.G tne individual itert 
- e '•' e - 




:^\-se : . 



r • a -= ^ 



lire o^r: 



.ts , ^o ca'^.oare the 



.c-lun .-.-dit 'see Chapter 7) with the data 
.vid-al Iter tables. The judgments of 
ee cr cc^rse t?cverage given t.ne content assessed indi- 
vidual It ens can te corfo-ared wit.n t.ne iter pattern response 
frec^ercies tc deterru.ne if variations m iter content cover- 
ace are reflected in variations m t.ne response patter frequen- 
ces. For exarple, one n^'p^ctnesis tnat .""ignt be evaluated is 
tnat tre rcre cc^rplete a test iter's content is ;}udged to have 
r^r covered during mstr'ucoion ,* tne higher will be the fre*- 
c-erry tne iter's I— categ^ory and the lower the frequency 
ir tne C*^! cateccr"''. 

liC'C«'inc tne response patterns of individual itei?^ is also 
"elpf-1 m detemnmg tne quality of it ens. Items which have 
nigr frequency of C— >I or I—^I responses should be carefully 
revie'-ed. Having tnis ir.fomation available for individual 
itens creatly assists tne iter analysis discussed on pages 121- 



Tne validitj cf tne ite- ;?attem analysis depends upon how well 
tne :.tens t.ner3 elves were constructed. If it can be assumed 
cnat tne iter3 are adeq^-ately designed and actually ineas'ure 
wnat tne*. "were designed to reas»*re, then such an analysis com- 
cmed -fitn tne statistical test results of Stages. 1 and 2 will 
crovide rest cf tne data wnich will contribute to decisions 
concerning t^raining effectiveness and trainee achievement at 
tne level cf sucect ratter coroetence. 



' ♦ t ?-r»:t? r w 



. .nATEC IN FX GURSS 1 



THE rTE:v PATTERN TABLSS OF THE 
L AKD 12 (see Aopendix E)** 




99 



USING THE ANALYSIS RESULTS: EVALUATING TRAINING 

The final results of the statistical analysis of score data' 
(stages 1 and 2) should be recorded on an analysis summary 
table of the type shown in Figure 13*. In this the stafi. 
will have (all on one form) the score distributions with their 
respective means ang* standard deviations together with a sum- 
mary of results from the application of tests of . significance 
to the trainee grouo as well as individual tra^e Test/Retest 
scores. The score data is grouped by separate Tfem subsets 
(e.g., Item Sets 1-5) as well as by total (composite) test. 
These data, when combined with the results of the response pat- 
tern anal^^sis and Curriculum Audit providi a comprehensive 
profile of a trainee group's performance in a pre-/post- 
instruction testing situation. ( NOTE : When data have been 
analyzed by trainee subgrouo, the score data for each grouping 
should be presented separa£ely ~ to facilitate mter-group 
comparisons ~ either on the same or, when the number of train- 
ees is large, 6n^ separate summary sheets.) ■ » 

Valid inferences concerning training effectiveness and trainee 
achievement in the area of subject matter competence can be 
drawn from this body of data if the test instrument which 
genercfted these data was constructed according to the struc- 
tured Guidelines provided. In addition, certain technical ^ 
factors relating to statistical testing and achievement crite- 
ria, when taken into consideration, increase the confidence 
with which such inferences can be , drawn. 

stat istical Test Results When interpreting the results of 
repeated atoolication ot tests of significance to the J.tem 
group and individual trainee score gains, only a large number 
of "significant" differences should be accepted as providing 
evidence for the -effectiveness of training. This is d;if to 
the fact that when a large number of statistical tests is 
applied to a body of scoj^ data, a smaU.-; number of significant 
results can be expected to occur by chance alone.** One or a 
few marqinally "significant"* results (from a large ntober of 
test aoDlications) can, therefore, be misleading, but the con- 
sistency of a large numbar of "significant" differences can 
lervt Is a valid indication e.f the positive impact of instruc- 
tion on levels of subject , competence . Therefore, even when 



inoici 



"ns" indicates non-significant score gains .while p <. 05 and 
D <.01 indicate score increases found to be statistically 
significant at or beyond the 5% and 1% levels, respectively. 

Both Kish (10) and Selvin^l) discuss this factor in their 
assessment of the applicatiori- and misuse of tests signi- 
ficance in research; computational formulas are provided for 
estimating the nuntoer of significant results to be expected 
by chance when X number of statistical tests are applied. • 



ERIC 



Hi 



100 




101 



TO 


TO 


o 


TO 


. TO 


TO 


TO 


TO 


o 




TO' 




TO 


TO' 


TO' 












V * 


a 


a 

m 
r» 


a 
o 


V 
0» 

00 

r» 


a 

00 


V 

a 


V 

a 


V 

a 


a 
c 

o 

00 


V 

a 

00 


c 

c 

vo 


V 

a 

OS 
00 


V 

a 


V 

a 

m 


c 
PO 


P^ 

m 
r» 

' OS 
PO 


o 
m 

OS 
OS 

m 


O 

a' 


* 




V0 


OS 


00 




CM 
V0 


00 

^f 


o 
in 




m 
vo 


vo 
m 


CM 
VO 


vo ' 


00 


o 
m 


OS 

in 


m 


OS 


'I 




a 

. c 


» a 

c 


c 


c 


c 


Q 
C 


Q 
C 


a 

, c 


03 
C 


Q 
C 


Q 

c 


Q 
C 


o 
c 


c 


a 
c 


a 
c 






H 




o 


% 
o 

iH 






V0 


r> 
♦ 


00 


OS 


OS 


CM 


CM 


00 


CM 
iH 


00 


r» 


OS 


p^ 

iH 
OS 

^ 


vo 
CM 

m 

OS 


iH 
O 






00 , 


^ « 


vo 






V0 


r* 


r* 


o 

iH 


00 




VO 


vo 


ID 


m 


vo 


iH 






m 
V 

a 


lO 
V 

a 


a 

c 


Q 

c 


' 

a 

c 


Q 
& 


o 

V 

a 


iH 

o 

V 

a 


Q 
C 


a 


:**Q 
c 


(3 
C 


c 


o- 


m 
o 

a 


a 
c 


^ 

OS 


VO 

o 






o 


r» 


o 


r) 


r) 


r) 


OS 
CM 


r) 


r> 


o 
po 


a 


O 
PO 

f 


00 

pn 


00 
CM 


pn 
pn 


pn 
pn 


o 
PO 

r* 

00 


^ 

OS 

o 


iH 
O 

V 

a 




•H 


00 
CM 


lO 
CM 




r) 

<M 

• 


<M 




vo 


lO 
CM 


r» 

CM 


irt 

CM 


00 
CM 


OS 
CM 


<s 


pn 

CM 


00 
CM 


PO 

CM 








•H 
O 

V 

a 


m 
o 

V 

a 


iH 

V 

a 


o 

V 

a 


o 


■? 


O 
V 


a 

c 


Q 

c 


Q 
C 


iH 
O 

V 

a 


c 


Q 
C 


iH 
O 

V 

a 


ID 
O 

V 

a 


Q 
C 


VO 

o 


o 

vo 




* 


so 




V0 






r) 


iH 


lO 


vo 
PO 


00 

PO 


00 

PO 


OS 
CM 


OS 
CM 


00 

p^ 


m 
pn 


PO 
• 


m 
po 

r* 
r* 


OS 


• ^ 

o 

V 

a 


• ; V 

4 


so 




b 

CM 


as 


iH 
<M 


OS 
CM 


« 

00 


»CM 


OS 
CM 


00 
CM 


PO 
d 


r» 

CM 


OS 
CM 


00 


CM 
CM 


VO 
CM 


p^ 

CM 


m 


• 




































t 


1 




f 


so 




00 


OS 


o 

CM 


CM, 


CM 

n 


n 




ID 
CM 


so 
CM 


r» 
d 


00 
CM 


OS 
CM 


o 
pn 


PO 




> 


ea 





§ -6 

X cs 



o 



•ERIC 



113 



•102 



the item set and composite scores axe found to be significant, 
the total number of "significant" Test/Retest soore differences 
within item groups must be considered and weighed as evidence. 
For example, in Figure 13, the large number of significant 
decisions among trainees for Item Set 1 (i.e., 16 out of> 21) 
and the composite test (23 out of ^1) provides a strong indi- 
cation of the positive impact of instruction on comoetence, 
both^with the subject matter sampled by Set 1 and with the 
overall subject matter. However, only 6 of the 31 trainees 
displayed significant score gains in Set 2 and none was dis-^ 
played in Set 3. Given the high significance of the mean s'^ore^ 
gains in these two areas (i.e., p<,01), the small jiumber of 
within-set "significant" decisions indicates that the training 
was less effective in raising levels of competence irt these two 
areas than with the Set 1 material and the overal-1 material. 
Reporting (in the final write-up) both the overall results of 
testing for the i^em sets and composite together with number o^ 
"significant" results out of total tests applied within item 
groups will allow a more adequate and reliaile assessment of 
training effectiveness in general and will allow the evaluators 
to assess the relative effectiveness of training by subject 
areas and by individual trainees within subject areas. 

One final consideration regarding statistical test results: 
although both the 5% and 1% levels are suggested here as accept- 
able significance levels when assessing score gain, the training 
staff might want tc consider only the ,more substantial score 
gains as providing evidence of training effectiveness. In ^ 
such cases, the staff might only accept as evidence those score 
gains significant at the 1% level or higher (e,g,, the ,005 or 
,001 levels). The higher the level of significance for a par- 
ticular Test/Retest score gain, the more confident will be the 
inference^ drawn concerning the impact of instruction in affect- 
ing compeVence levels, (The reader can set significance Levels 
different from those suggested here and obtain the sampling 
distributions ^or these levels — like those provided in Figure 
F2 for the 5% and 1% levels — from any standard statistics 
text, ) 

Criteria of Achievement Another technical factor that must be 
considered when interpreting the testing results is establish- 
ing criteria of achievement, i,e,, defining standards of what, 
constitutes high competence in a particular subject matter 



It will not usually be possible for a training staff to set 
meaningful, pre-determined score values which the trainees must 
attaiTi or exceed in order to be considered highly competent in 
one or another subject area. The reason for this lies with the 
nature of the subject matter covered during instruction. Accord 
ing to Ebel (12) , ^ Si 



area. 




103 



"CoursJ^ontent is usually selected on the basis of subjec-* 
tive decisions, often by individual instructors. As such, 
it hardly possesses the characteristi-cs of an absolute stan- 
dard of achievement. Nor is it ordiriarily possible for the 
constructor of an objective test tp gauge the difficulty of 
his items oreciseLy enough to define a fixed standard of 
achievement with -.respect to that content." - \ 

•If it is not possible to set meaningful cut-off scores for 
determining high competence in the post-ir^struction period, ^ 
another indicator, or set of indicator s\^ of trai,nin^ effective- 
ness most be employed, ^ 

The first indicator to consider is the results of the applica- 
tion of tests of statistical significance- However, even when 
there are highly significant Pre- to Post-Test score increases, 
these mayor may not be acceptable to the staff as an indica- 
tion of training effectiveness. Consider,^ for example, a test 
with a maximum score of 100. A mean composite score increase 
(all trainees) from a Pre-Test of 20 to "45 on the Bost-Test is 
found to be statistically significant. Given the significant 
^finding, the training staff is faced with the task of assessing , 
how effective the training sequence was (and how competent the 
trainees are) when an average of 55% (i.e., 100-45) of the 
subject material sampled by the test items was not learned. 
In this situation, the significance of the score gam is not 
sufficient evidence for any definitive evaluative decisipn. ^ 
This, combined with the usual absence of established criteria 
against which trainees' scores can be compared, makes «it neces- 
sary to carry out a further assessment of testing results. 

Level and Magnitude of Score Movement 

Along with the magnitude of the Pre- to Post-Test score move- 
ment, the level at which that movement occurs must also be con- , 
sidered when drawing inferences concerning training effective- 
ness. That is, trainees may be distinguished according to 
subject competence reflected by the magnitude of the Post-Test 
scores and/or their achievement which reflects their performance 
on the the Post-Test compared with their performance on the 
Pra-/rest Training administrators and the tr^nee supervisors 
{iiy the post-training jobv^itua^ipn) will b^interested in both 
♦ th;6se prarameter^. 

The amount of^omi^tence a trained displays will be of obvious 
interest to those who are responsible for placing him m a 30b 
\ or assigning him job duties. 'Phe training administrators will 
also be interested in seeing that /the trainees demonstrate high 
levals of competence, A3 indicative of the effectiveness of 
their training efforts. 



FRir llo 



\ 

\ 



104 



.The ' amount of achievement displayed, however, is also an impor- 
tant ^pl^ce^oFTnTofmaHon for all concerned. To take an obvious 
case: if two trainees complete a training sequence with Post- 
Test scores of 90%, it can be assumed that they are, more or 
less, equally competent with the subject matter being assessed. 
However, if one of these trainees started out with a Pre-Test 
score of 40% and the other with a Prd*«Te3t score of 85%, the 
large amount of achievement demonstrated by the first trainee 
may be said to show: a) that the sequence of instruction was 
hr^hly effective in increasing his subject matter competence 
and b) that he' has a great capacity for self-improvement, 
which could.be an important quality to take into account when 
deciding what job to assign him. 

In addition *to the amount of achievement (i,e,, the magnitude 
of the Pre- to Post-Test score movement) , the level at which 
the score movement j.akes J>lace within the possible range of 
score change will be of importance . In. order to consider both 
parameters of achievement , it is recommended that an additional 
/ score be computed for all groups, subgroups and individual ele- 
ments; that is, for the entire trainee group, for the total 
item group, for trainee and item subgroups and for individual 
trainees and items. Such a score would consist of the sum of 
a competence score (i.e,, the Post-Test score) and ^n achieve- 
ment score (i.e,, the Post-JSest score miaus the Pr6-Test scoir^) . 
Giving both scores equal weight in computing the new combined 
score allows' for a trainee who has made a great scOre gain to 
show well in. the overall picture, even if his final competence 
score may be somewhat lower than that of another trainee, ♦» 

^ A sample set of cOmbinec^scores of the type described above is 
displayed ii^^gure 14, In the sample, the combined Achi eve- 
men t/Compel^ce" scores (A+C) are grouped from highest (190) to 
lowest (30) by intervals of 10 dfcore units. It should be noted 
that the listing is not exhaustive of all ghssible combinations 
of competence scores and achievement scored,, but ^nly a small 
sample of possible combinations that serves to illustrate the 
necessity of considering not only the magnitude of the score 
increase but also the level at viihich the increase occurred. 

As can be s.een, the highest possible A-K? score is 190, based on 

a Test score' of 10 and a perfect Retest score of 100 while the 

lowest spore possible is based on Test and Retest scores of^\ 

10 and 20, respectively/,* 



Actually the highest A+C score possible is 20Q based 
Test score of 0 and Retest score of 100 Cwhere C = 1 
A = 100) and the lowest possible. is 0 where the Test 
Retest scores are both 0, For purposes o£ illustrat 
however, not possible A+C scores are necessary. 





FIGURE 14 



ACHIEVEMENT/COMPETENCE SCORES (UNWEIGHTED) 



Highest to lowest pbsslrble, intervals of 10 points i 



Pre Post 


< 


A+C 


" 10 --►-100 




190 


20 ^ 100 


= 


180 


10--V 90 

30 100 




170 

170 


20 — > 90 

40 100 


= 


160 
160 


10 — ^ 80 
30'-^ 90 

50 100 




150 
150 


20 -► 80 
40*-^ 90 

60 100 




140 
140 
140 


10 70 
50 90 
■ 70 100 




130 
130 
130 


20 70 
40 — ► 80 
60 -> 90 
80 -> 100 




120 
120 
120 
120 


la 60 
30 — > 70 
50 — > 80 
70 -> 90 

90 100 


V = 


110 
110 
110 
110 
110 



6 



20 




60 


= 


100 


A A 

40 


— ^ 


70 




1 A A 

100 


60 




80 


— 


100 


80 




90 




100 


10 




50 


= 


90 ; 


30 




60 




A A 

..90 


50 




70 ' 




90 


70 




eq 




90 


20 




50 




80 


40 




60 




80 


60 




70 




80 


io 




50 




70 


50 




60 




70 


20 




40 




60 


40 




50 




60 


10 




30 




50 


30 




.40 




50 


20 




30 




40 


10 




20 




30 



106, 



What does the A+C score tell the evaluatar? It is quite clear 
that upper scores (140-190) reflect both a high degree of com- 
petence as well as'^a high degree of achievement. One could 
infer from these data a highly effective training program. 
That is, thevtrainees started off with very low Pre-Test scores 
and displayed large score gains (i.e., hi^h achievement) with 
a resulting high Post-Test score- (i.e., high competence). Sim- 
ilarly, low A+C scores (i.e., 30-60), indicate a situation where 

• trainees began with low Pre-Test soores, made low to moderate 
score gains, and completed a sequence of instruction with rela- 
tively low levels of competence ^ the subject area under assess- 
ment. The need to revise the curriculum would be indicated on 
the basis of these low A+C score^. 

Consider, however, the combinations of Test and Revest 
scores which provide an A+C scoip of 110. (Refer to the data 
in Figure 14.) As can be seen,/both a low Pre-Test scorfe of 
10 and a high Pre-Test score oA 70 can result in an A+C score 
of 110. Here, a low Pre-Tesy level of 10 combined with a 
moderately large score gain pf 50, and a high Pre-Test level 
of 70 coupled with a relati/ely low gain of 20 result in the 
same A+C score. It is clejk from the data that the first score 
situation (i.e., 10 —> 60 ) displays a greater magnitude of 
acljievement than the secdnd situation (i.e., 70— >90). However, 
the 70—^90 score changer reflects a higher level of post-training 
competence than does the 10 — > 60 change. What inferences can 
be drawn concerning th4 impact of the training experience based 
upon the s^ two testing situations? It would seem that one can- 
not readily draw any/definite conclusions. 

Since competence and achievement are given equal weight, it 
cannot be readilv/sietennined which of the two score situations 
represents the more effective training experience. Such con- 

* elusions will bfe based upon the criterion for effectiveness 
selected by thfe evaluator (s) . That is, if the highest level 

of competence^ attained is the effectiveness criterion, then the 
second testing situation (i.e., 70-^9^9) will reflect a more 
ef f ective 4^raining sequence than the 10 — >6Q situation. 

However /if the effectiveness criterion is the magnitude of 
achievement, than a different picture emerges. In this case, 
the fi^st score situation (10— >60) displays a greater degree 
of achievement, than the 70 — >90) situatior/. Based upon the 
ach^vement criterion, then, the first score .situation reflects 
a ^ore effective training sequence than the aeiiond. 

^(t^ce cpmpetence and achievement are afforded equal weight in 
Computing the A+C scores, the evaluator might decide to consider 
any combination of achievement and competence scores resulting 
in similar A+C scores as reflecting equal, levels of effective- 
ness (or ineffectiveness, as the case may be) . Based upon this 



ERIC 



116 



4 



107 



reasoning, the scores situations 10— >60, 30 — ^70, 5G-^80f 
70-^90 and 90— > 100, since ^hey, all "iiave an A+C score of 
110, would be considered by the evaluator as equal levels 
of' training effectiveness. 

The use of unweighted A+C scores of the type described above 
is based upon the assumption that equal gains of raW-score 
points (anywhere along the entire range of potential scores) 
represent equal increments in competence or achievement. 
This means, fox example, that it is as difficult to increase 
a Pre-Test score from 20— ^40 on the Post-Test as it is to 
increase a score from 60— ^8Q, both of which involve equal 
increments of 20 raw-score points. 

If evaluators are willing to accept this assumption, then 
the use of unweighted A+C scores can be used to draw infer- 
ences concerning training effectiveness and trainee achieve- 
ment. 

However, the strong possibility exists that the difficulty 
in increasing a score a certain number o'f raw-points varies . 
at different levels along the range of possible scores. 
That is, equal gains of score points may not correspond to 
equal increments in competence (or achievement) at all points 
along the score range. Some evidence^ to support this con- 
.tention is provided by Diederich (13) in an article describ- 
ing the results of Pre- and Post-Tests given to 1,400 college 
students. In a discussion of the difficulty in translating 
gains of raw-score points into increments of ability (or 
achievement), he states that "it is harder to get from the 
mean (score) up to plus one standard .^^viation than it is 
to get from minus one 'standard deviation up to the mean. 
(Of course, this finding cannot be generalized to all test 
situations, with a high degree of confidence, without fur- 
ther study. ) 

If the evaluators accept this assumption as true in their 
particular testing situation, then it is necessary ;when 
deriving the A+C scores to use weighted values,/^Kat is, 
the scores should be weighted to reflect the assumption 
that as you increase the Pre-Test score level, it becomes 
increasingly difficult to increase that score (on the Post- 
Test) by X number of raw-score points. For example, the 
weighted A+C score will show that it is more difficult to 
increase a pre-score of 50 to a post-score of 70 than it 
is to raise a pre-score of 30 to a post-score of 50, 

The relatively higher( level of achievenient attained in increasing 



ERIC 



108 



a score from, say, 50 to 70 compared to that of increasing a 
score from 30 to 50 will also be reflected by the weighted A+C 
scores. 

The weighted A+C score curves of Figure 15 and the tabl^ in" 
Figure 15A displaying both Test and petest scores and the 
derived weighted A+C values will serve to illustrate the use 
of t^e weighted score in the assessment of the effectiveness 
of a sequence of training. (The mathematical equation and the 
parameter values employed in generating the series of curves 
from which the A+C score values are derived are provided in 
APPENDIX G*.) 

The A+C curves and the Test, Retest and weighted A+C score 
values are, like the • unweighted score data in Figure 14, for 
illustrative purposes and represent only a small sample of 
possible score curvers ^nd combinsr^lons of score values that 
would result from a large numbe/ of Test/Retest administrations 
of the same instrument. 

Before discussing how the weighted scores might be employed in 
the assessment process, the procedure for deriving the weighted 
values for any Series of Test and Retest scores will be con- 
^sidered, using the data in Figures 15 and 15A, 

The weighted A+C score values for a specific set of Pre- and 
Post-Test scores can be obtained from "the score curves in the 
following manner: 

Consider the case in which the Pre-Test score (Test) 
is 10 and , the Post-Test score (Retest) is 20. Locate 
the ^st score value along the vertical axis and the 
Retest score along the horizontal axis. The point on 
the graph at which these two values intersect — the 
coordinates labelled (Id, 20) in Figure IS — defines 
the A+C s^ore value. In this case the A+C score asso- 
ciated ^ith those coordinates is 10 since the A+C curve 
that passes through coordinates (10, 20) .is 10 (the 
cii?cled*/iumbers next to each of the curves are the 
weighted score values) 

In cases where the A+C curve does not pass directly 
through the intersection of a pair of Test/Retfest 
coordinates, the A+C value for those coordinates can 
be obtained by interpolating between two adjacent score 



The mathematical equation and the resulting family of 
weighted A+C score curves were developed by Dr. David 
Wo-rfers of the Institute staff specifically for use in 
this Manual. ' * 



ERLC 



12x) 



109 



FIGURE 15 




RETEST SCORE (R) 



FIGURE 15(A) 



WEIGHTED ACHIEVEMENT/COMPETENCE SCORES* 

« 

^Grouped according to' amount of achievement (10 point intervals) 

Pre~> Post = weighted A+C Score 

10 Points ' 20 Points 30 Points *' 

10 30 = 20 10 40 = 31 

20 40 = 20 20 50 = 32 ^ 

30 -> 5CL = 22 , 30 S8 = 33 

40 60 = 25 40 70 = 36 

50 70 = 28 50 80 = 40 . 

60 -> 80 = 32 6.0 ^ 90 = 56 

70 -> 90 = 49 70 lOd = 59 

80 -> 100 = 55 

40 Points 50 Points 60 Points 

10 50 = 41 10 60 = 50 10 70 = 61 

20 60 = 42 ^ 20 70 = 53 20 ^ 80 = 62 

30 70 = 44 30 80 = 56 30 90 = 66 

40 80 = 47 4Q„-»-^90 = 66 40 100 = 75 

5g/-> 90 = 54^ 50 -» 100 = 69 

6ffv-^ 100 = 65/ 

70 Points ♦ 80 Points 90 Points 

10 80 = 71 10 90 = 81 -10 100 = 90+ 

20 90 = 74 20^ 100 = 89 

30 100 = 79 



10 




20 




10 


20 




30 




1? 


30 




40 




12 


40 




'50 




12 


50 




60 




13 


60 




70 




14 


'70 




80 




18 


80 


-> 


90 




30 


90 




100 




50 



* See Appendix G for derivation formula* 



.12 



COTpared to a lower A*C value. The relatively higher level of 
difficulty car. De the basis for inferences of higher levels qf 
effectiveness. Such inferences will hold whether the Test/ 
?-etest cccparisons are being nade between: 



1) Iter: sets 

2/ individual trainees . • , 

3) trainee subgroups 

4 different trair.mg sequences (provided that the same 
test instrument is administered to bottT groups of 
trainees • 



It sno-ld r^^lear free: tne above discussion that it is neces-- 
s-ary to ccnsider not only the statistic^^signif icance of the 
Tes^i P.etest score gams that c^ccur but ^so the levels at which 
tnose gams occur wnen drawing mterences concerning the impact 
of tne training experience on leyels of subject matter coape- j 
tence. 



Whether weiahted or unweichted scores are more valid 



indicators of levels of achievement iCwheii raising scores from 
varices Pre— Test levels; canfK>t, at this t^ise, be determined 
for each situation m which a test instrument is administered 
ur.der a Test., Retest design. At the risk of complicating mattei|s 
further, it needs to be stated that very little is known con- 
cerning -conparative difficulty at different levels along a 
score range. One can hv-pothesize a d^^fficulty curve in the 
fcrr of a parab-ola reflecting the extrene difficulty in raising 
a Kcore at botr. t-he lower cind upper ends of the score raiige^ ' 
witn the rLid- range r>emc the area where score increases are 
easiest to effect. This g-uestion is being studied by analysis 
of a series of test data. For exanple, it is izjteresting to 
note that m Figure 13, trainees witfV Pre-Test scores below 
gamed an average of 26." raw percentage points compared 
witn an average of only 18.6 ajiong those whose jPre-Test scores 
were 5S or higher. ^However, any definitive conclusions con- 
cerning score leveland difficulty will have to await more 
extensive research m this area. Thus, the discussion of 
weig.^.ted and unweighted stores, even tJrough there is evidence 
cited for the validity of the former, was presented to provid<i 
pcss:^le approaches tc the assessment of level i and magnitude 
of score char.ge. While the approaches employefl may differ, t 
iitpcVtant pomt is that the interaction of the 
change and the r^gnitude of that chajige be considered an inte^ 
gral part of the analysis of training effectiveness* Collective 
experience m training evaluation of the type 
Manual riay well show that the use of the Achiej'/ement/Compe tence 
score vespecially when weighted) is an important refinement of 
testing nethcvdolo^g^*. ^ ' 



le 



ERIC 



1 



C i 



113 



In terms of standards for trainee post-instruction competence, 
however, the basic criterion will not be (as with the case of 
assessing training impact) their Tast/Retest score changes. 
The ultimate indicator of levels of competence will be the 
Post-Test scores , A trainee who ..begins with a Pre-Test score 
of 10% and obtains a Post-Test score of 75% may have learned 
more (attained a higher level of achievement) and have demon- 
strated higher motivation and other desirable qualities than 
one who shows an increase from 75 to 90%, But the 90^ score 
must still stand as an indicalTor of a iiigher level of post- 
instruction competence. From Xh^e jx^int of view of the super- 
visor who will assign the trainee to a specific job in the 
post-training period, the higher level of competence attained 
by the second trainee may be more * important than the amount 
of achievement disolayed by. the first trainee. However, the 
selection of one trainee over another is a 30b specific deci- 
sion that would have to be made by supervisors considerina 
both the requirements of the job and the qualifications or the 
-trainees being considered. It is, therefore, important that 
the maximum cunount of assessment data for each individual X>e 
made available in the post-instruction period, ^ 



J 





114 



Summary 

The tables provided m Figores 16 and 16A are designed to pro- 
vide suncnary mfonnation about the test instr^oc^nt , the traih- 
ing, and the tra^ees. ,They illustrate what t^ test instru- 
ment will show when the analysis has been conpleted. 



1. Part I (Table 16) The figures entered m column A 
(Pre-Test Level) are the total percent of itesis answered 
correctly on the^ total test and on subsets of items; and for 
individual items, the percent of times they were answered 
correctly. (For example, if a test of 100 itezas was taken 
by 50 trainees, there is a potential of 5,000 correct answers. 
The, percentage of those 5,000 which were actually correct on 
the Pre-Test is the jiumber that is entered at the top of this 
column in the row labeled "Total Subject Matter".] Comparable 
entries apply to the Subset irows . For individual items, how- 
ever, the percentage of trainees who answered each one correctly 
is enter^* lower m this same column. 

The vai\^s m column A have been given the caption of "Diffi- 
culty", on the assumption that the easier items and subsets are 
answered correctly more often. ^' 

The figures entered m column B of T^able 16 contain similar 
information about Post-Test — the percent of items answered 
correctly on the total test and item subsets; and for indi- 
vidual items, the percent ©f times they were answered correctly. 

The difference between the Pre-Test and Post-Test score levels 
IS entered m column C (Amount of Change) . The column values 
ref lecii the effectiveness of the training sequence — how much oi 
the siib3ect material has been transmitted to the trainees. 

Column D (Direction of Change) is divided into four parts, 
showing the direction of the change between Pre- and Post-Test, 
It shows what % of total test subsets and individual items 
remained correct or incorrect on both testsi which went from 
incorrect to correct, and which went from correct to incorrect. 
This tells something ' about the quality and consistency of the 
test material. ~7 



2. Part II (Table 16A) In this part, column A contains 
the mean Pre-Test scores of the total trainee group and of 
trainee subgroups, and the scores of the individual trainees 
on the Pre-Test. This column represents how prepared the 
trainees were at the beginning of training (i.e., how competent 
they already were with the subject material to be c^5vered) • 




12 




3 S 

3 3 



/ 



< 



0^0 



I 



+ + 



O 



3 
3 



0 



^1 



52t 



o 



+ 



- 001 




3 - 



01 



2 0 

3 2 



4-4 



•|0 



S5 



^4 



3 t 
- 0 



+ + 



s 1 0 



1 i< 



3 r3 
— 0 



^ 2 C 

5^ i 



o o o 



> 
z 



ERIC 



127 



.J 
> 

Z 

3 O 



O 



2 2 

O 

z: 

c 



Eh W — 



12 

•J 



< 



o ^ 
3 < 

c ^ 

> 

0 

< ^ 
0 

^ c 

X 0 

T a 

-H ^ 

0 



O C V4 
C 4J O 

c 

c e 
CO:? 



O 
E 

O 

o o 

o 

o c 
c 

13 



5? 
C 

o 

E 

o 

o o 

o 

o e 
c 



M 

H o 



ft- 

W) 



s: 3 
c 

o 

p. E 

0 

o 

3 

o a 



U 3 HI 

K o e 

S5 OC S 
CO 



O 



5^ 



e 

c ? 
0 




f + 



o p- 



5^ ^ 



0 « 

U U 
^ Eh 



b3 ! 
Eh W 



ERIC 



128 



117 




The mean score of total traijii^^-^roup and trainee sub- 
groups, and mdividuarl trainee sc'oris on the Post-Test are 
entered m column B . These figures show the amount of ^compe - 
tence displayed by trainees at the end of the course, collec- 
tively and individually. 

The differences between the scores on Pre- and Post-Tests are 
given in coluny C of Part II, This shows the ainount of 
achievement the trainees displayed — how much they learned 
during the training. 

The Amount and Level of Change , In Part II of the table, 
column D shows the amount and level of the change from Pre- 
to Post-Test scores, as represented by a score value derived 
from the curves like those shown m Figure ISA, This repre- 
sents the amount of competence and acj:iievement displayed by 
trainees. 

Note ; Part I of the table can be repeated for trainee subgroups 
and even for individual trainees, if such detail is deemed 
useful. Similarly^ Part 2 of the table can be repeated for 
each subject subset, and for individual items. 



} 



118 . • ♦ 



PRESENTATION OF DATA; THE EVALUATION REPORT 

! 

When the entire Test/Retest procedure is completed/ including 
analysis, it will be necessary to gather all of the informa- 
tion about It into some form of );eport — either for program 
administrators, funding agencies or government departments, 
or simply for the record. While a nximber of charts and tables 
will have been generated during the course of designing and 
implementing the instrument, it is best to summarize the data 
from these sources m an easily accessible form. Such a sum- 
mary should state clearly what was learned^ and what implica^ 
tions can be drawn from^hat happened. 

It is recommended that a report should contain the following: 

1. D escription of the Test Instrument . This should be a 
simple statement, describing tne instrument by the numlper of 
items It contamd, how the items were grouped (if they were) , 
with any additional information that might be pertinent. . 

2. Description of the Training Group . Information on the 
number of trainees and any special characteristics, by which 
they were grouped should be stated. 

3. Dates of Applications . The dates should be noted, to- 
gether with a brief description of the circumstances where 
necessary. If any radical differences between the two admin- 
istrations occurred, these should also be noted. 

4. The Curriculum Audit . A brief description of the type of 
audit conducted, concurrent or retrospective and, if the latter, 
who conducted it, should be provided. The rating form employed 
should be attached to the write-up to supplement the report. 

5. Reporting Trainee Analysis Results . The results 'of the 
assessment will be described in three ways. 

A. Results for the total training group ; Include percent 
or number correct for Test and Retest, percent and direction 
of movement from one administration to the next, and whether 
that movement is statistically significant. The level of that 
movement should also be noted. 

B. Results for trainee subgroups ; The same information ' 
should be given here. ^ 

• C. Results for individual trainees ; If the trainee group 
was small, these results could be listed here. In most cases, 
however, it would only be necessary to .refer the reader to the 
appropriate table, aad note'Tiere any trends. • 



ERIC . ' - 

- — ^ 



* 

115 



D. Interpretation of trainee results; Based on the data 
presented above, some preliminary interpretations should be 
offered. 

ft 

6. ' Reporting Item Group Analysis Results ; Since the instru- 
ment is designed to measure the effectiveness of the training 
as well as the change in competence conong trainees, the re- 
sults will be further broken down as follows: ' 

A. Results of overall test ; Number of tftems correct on 
the Test and Retest should be listed, with t^ercent of change 
and whether or not that change is significant. Number of 
items that were answered correctly on both applications, in- 
correctly orv bpth, and that went from correct to incorrect and 
incorrect t6 correct should be listed. 



B. Results for item subgroups : 
should be giyen for item subgroups. 



The same information 



C Results for individual items :' Unless the test was 
extremely short, all the infdrmatio"n for individual items need 
not be reproduced here, 
noted . 



HoweW^r, specific trends should be 



D. Interpretation : A^ain son^e interpretation should be 
offered. 

7. Level and Magnitude of Score Movement : This information 
(as described on pp. 103-113) in combination with the results 
of the application of te^ts of statistical 'significance com- 
prise the major data set from which inferences concerning 
training ^f f ^ctiveness and traineje achievement will be drawn. 
The data on score levels and magnp.tude ^ould be presented in 
the form of tables of the type shpwn in Figures 14 and 15A. 
(Actually only one or- the other s^t of tables should be pre- 
sented depending upon whether weighted or unweighted Achieve- 
ment/Competence (A-k:) scores were ^computed. X Following the 
guidelines and ex^ples provided oh pp. 104-112, tables (dis- 
playing Testr Retest and A-k: scores) with brief summaries 
should be presented for 



a) the total trainee group 

b) the individual trainees 

c) trainee subgroups (when avbilabl^) 

General! S ummary Statemejnt and Redommindations : An^ overall 
statement oi t^e success ,4aiiure , pr uncertain performance of 



8 



the instrument, with regard to its usefulness as an evaluation 
of trainees aind training, should be made 



ERIC 



13 



120 



9. Attachments; It is suggested that the lowing materials 
be attached to the report where their reproduction would not 
involve excessive cost or labor: 

1. The Item Specification Table (see Figure 1) 

2. The Curriculum Audit Results (see pp. 72-73) 

I. « 

3. The Data Analysis Summary Profile (see Figure 13) 

4. The Unweighted (or Weighted) Achievement/Competence 
(A+C) Score Tables describing the "Level .and Magnitude 
of Test/Retest ^Score Change" (see Figures 14, 15, and ^ 
15A) 

5. The Data Analysis Summary Profile II (see Figures 16 
a^d 16A) ^ 




/ 



CHAPTER VIII 
ASSESSING THE TEST INSTRUMENT 



e- 
inee 



An analysis of items in the .form of a^^irs^ssment of Test to 
Retest fltesponse patterns has been suggested' in onder to pro- 
vide) additional informaiioti concerning the relativ^ ef fectiA/P 
ness of instruction for each subject area and for each 
within subject area (see pp. 89-910) This analysis also 
attempted to identify those items which were relatively in- 
effective for their intended purpose of assessing levels, and 
changeSiin levels, of subject matter competence (pp. 91-98). 
Thus, some data on the effectiveness of the test instrvment itself 
was provided by the item pattern analysis. However, a jfiore 
quantitative approach to assessment of test ef fectivene?^. (?Jan 
be employed when staff-time permits. This irivolves an^^X|ji's 
of the responses given by the examinee group to one admiKJf:a,t«;a- 
tion of a series of items. ^ fr v-:* 

ITEM ANALYSIS 

Three kinds o£ data are derived from the analys^is of individual 
items; - . \ *^ 

1) The difficulty level of an item (defined as the total 
percentage of examinees getting the item correct) • 



2) The discriminatir\g power of an item (defined by the 
degree to which an item differentiates betyeen high 
and low scoring examinees) . 

3) The relative 'effec^veness of the item's distracters 
(defined as the decree to which the examinees respond 
to the item's incorrect alternatives). 

Procedural Steps 

For purposes of illuLtration, the coverage of item analysis 
procedures will refer to the responses of 45 examinees ta test 
items assessing competence in the area of quantitative research 
methods. ^ , 

1. Rank the examinees fi:pm high to low according to total test 
score. 

2. Select out the upper one-third of the examinees (i.e., 15) 
and the lower one- third. 



ERIC 



121 

r 1 r. 



133 



122 



3. For eaoh 'individual test item, tabulate the number of 
examinees in the upper and lower segments who selected 

* the correct response and each bf the distracters. These 
can be recorded on an item card designed specifically to 
illustrate each item with its specific characteristics, 
(see Figure 17) 

4. Compute the estimate of item difficulty . The difficulty * 
leyej^i of an item is computed-by dividing the total number 
of correct responses tb that" item (EC) by the total number 
of examinees in both groups (N) . * • 

Diff iculty ^(D) = rc V j^QQ 

N 

The item difficulty for the sample data .^Figure 17) is: 

D = 15 X 100 ='50% 

30 • ^ ^ 

Note: The values of D can range ferom 0 to 100%; the ' 
larger the value the easier the item. 

5. Compute the estimate of discriminatory power . The dis- 
criminatory power of an achievement test item is computed 
by subtracting the number of correct responses in the 
lower group (C^) from the number of correct responses in 
the upper group (Cy) , and dividing the result by the num- 
ber of examinees in the upper group (Ny) • 

Discriminatory Power (DP) " 



For the sample data, the estimate of discriminatory power 
is : 

DP = 12-3 = .60 

15 

6. Assess the effectiveness of item distracters . A distrac- 
ter is considered effective if more examinees in the lower 
than in the upper group select it as the correct answer. 
The effectiveness of each distracter of an item can be 
determined simply by observation of response frequencies 
•for an item. (An example is provided below.) 

Interpreting the Item Analysis Data 

The general effectiveness of individual items will be judged 
on the basi^^, of an assessment of all of the item characteristics 




131 



123 



FIGURE 17' 

SAMPLE CARD FORMAT WITH ITEM CHARACTERISTICS 



Quantitative Research . Methods 

Subject Matter: Demographic Analysis 
Behavioral Outcome: Ability to Calculate 



ITEM: Approximately how long does it take for a pop- 

ulation to double if the annual growth rate is 

equal to .03 (3%)? 

a. 15 yea^s 

b. 23 years 

c. 31 years 

d. none of the above 



ITEM 


ANALYSIS 


RESULTS 


(12/73) 




Alternatives 


a 




c 


d 


no 

response 


multiple 
response 


Upper 1/3 
examinees/ 


0 


12 


0 


3 


0 


0 


Lower \/y 
examinees 


0 


3 


12 


0 


0 


0 



ni fficultv level ; 50% ^ 
' ' Discrimination Index ; .60 

(comments:) 



124 



provided by item analysis. How this data is e:rp::yed i- rat- 
ing am item's effectiveness for its irte^ced p -rtc se rar 
illustrated with the results cf the sarrle aral.sis fcr tre 
item in Figure 17. 

The item is ir. the riddle ciff-r-lt> r*-re is 
evidenced by the fact that zZk cf t"e €xa.r-re^s 
(15 out of 30) ^ot the, ater. ccrrett. Tre it^::: 
discrii&inates i/^"a positive direct::- as s-r-- 
by the fact that- 12 x>-t cf 15 e^sLTi^ees i* t-r -pt-er 
group got the iter: correct wrilr i""l; • "^-t zi 
15 m the lover gro*.p did sc. T^xe d- = cr-7-"i- 
tion index is qjite r.icr. .61 i- c:.c^=:t-- c t-it 
fc^r an iten of 50* diffic-lty, it is t-=rr:cr-n- 
iog effectively since it distir.cxisres ader^it^- 
ly between the high and Ic'- ccrp-ete-c* =rr-p = 

There is, hcrwever , wice varist-z- t-^ -rife:- 
tiveness of the item's distra.cters 1 - = t r i : t ■£ : 
a is coapl etely ineffective sirce it ittrictrc 
no exaamees f rOw eitr.er tne -pp-er c r lr»»^i r 
croup. Distracter ^ is :^--ct*icrirc at s^rirr-iz 
effectiveness since it attracted e*'^^--*"^ 
the upper group and alriicst all tJ"e exar.r^^i 
the low group. Distracter ^ is c-.t-r .r^ffect.vt 
m that It attracted rcre ex.an.nees frcn t " 
jpper than lover cccrp^terce crcjp r i r i _ s 3 

of the me f f ect 1 ve te ss cf t»»c cf t*^ cirtrirt^r? 
the discriminate ry pcwer z f tJ^e -ter r.-t^r 
high. If sc desired, t^re disrr^ri-atcr. ;o**^r 
could prob^ly De increased c ^. rep-ac**c i-t'er- 
natives a and vitr. :^re efferti'.e d-rtrict-^rr 
(An exajtmaticn cy tne staff c: ; t-^5-r i - r - 

natives were pc^cr distract ers s*c-lc far tati 

the design cf rscre effective replacer^i- ts 

The type of iter descrit-ed arc-.-^ -c-ld c*^ Txst appricr-it'^ f:r 
inclusion into rhe typ^ cf test i-str-rve-t prcpcr-i: f.r 
Manual. It is q^jite effective ir di st ir = -i s • r r-er**^-:- -.r* 
and low conpeter.ce exarmees t"e enteric rrr rcrrntx-ri r-^ - 
ing total test score and it is ir f€ r^dijs d.ff.r.lt; 
range.* 

\ 

— 

* Many e xp^ts i n zne field c f -c at.cnal r*eii.r^r<-t r: rrc^ 
mead,;th.at for raxirrur utility, rLa:crit; zi -ti^^ rcr- 

prxaang the test sho-ld be ir. t-te tl^zz'^c r*szc zt z^tt ^t^T: 
with" soae easy itens 'placed re^r tr.^ c-e^-inrir^ :f t-* : 
test) to encourage Ickf atilit_. e y i r e^: r ir*d rcr^ rcr^ 
difficult iters tc prov''^-d^a cnallerc« fcr zrt rcr< rcr::e- 
tent exarmee (14^. 



126 



Ms-ed cr. tr.e ass jrrpt-ior. rr.at a sequence of training presenting 
s-cc&e, if not all, of tr.e saire subject natter will be conducted 
again m tr.e f Jture. Each iten card vould be titled by both 
tr.e s-r^ect area ar.d i>ehavicral outcor^ assessed by that item. 
In tr.is way, an ite^r card can be cross-referenced and pulled 
frcs tne file cy eitr.er content cr r^na%'ior. (The fomat for 
sjrcn a card is s".cvn m Fig^ure 1". 

When tr.e -ext training sequence is to ce constructed, the 
appropriate itens can ze pulled fron the file and conpil^ 
mtc a test instr^ent. Any changes m ^ourse content car. be 
^zZ'Z:szec for c/ constructing nev itens to cover the new nate^ 
rial. Tne ^.ew iterrs vo^ld tnen be analyzers m the later iters 
ar^alysis and eitr.er added to the iter'filelor discarded. 

If It car. ce assuzed tr.at sere, cr all, of Vne training mat^e- 
rial is tc c«e repeated at a future date, thac the staff tiiue 
and effort invested m a post- instruct ion ite3 ar.a lysis will 
De veil sp«nt. Valid and effective it ens, esreoially those 
measuring tr.e rcre ccrplex Denavioral outcor«e9, are usually 
q~uite diffic^t to construct and are tire-conguEing . Thus, 
Iters s.ncvn to ne nigiiiy effective m assessing competence 
sno-ld t^ filed away and dr avn out when con str^uc ting the next 
test instr urgent. Tne large amount of tiise and effort saved m 
not navmg to oor^truct all new itens as well as the high 
q-ality of iters availai^le for test inclusion will core than 
ccrr-ensate for the tire and effort expended m conducting a 
ccrprer.ensive iter analysis. 

rcirg tne iter analysis at tnis tire also allows for the m- 
ccrtcration cf data on changes m iter responses fron Pre- to 
Post-Test tc detemr.e whicn iters should be analyzed. Using 
tne tarle cf iter response patterr^ 'see Figure 12), it can be 
seer at a glance wni ch iters were rest frequently answered 
correctly and incorrectly on the two applications of the test. 
Tr.-ose witn a high percentage, cf C — > I would be subjected to 
scrutiny ar.d prooab^y discarded. Iters with a high percentage 
cf r — ^ Z should be subjected to analysis to see which alter- 
nate incorrect; answer is being rrost frequently chosen (on 
both applications. . Itens with a high percentage of I C 
would clearly be good itens, as would , items with a high per- 
centage of C — > C .although m this case, if there are a great 
rany iter^ m t.his category it would probably be a good idea 
to eliminate sor-e of them as making the test too easy.) 

2. D-urmg Test Construction . If staff time permits and an 
appropriate exarmee group can be assembled, the test items 
(constructed according to the guidelines provided) can be 
given a trial administration employing the same testing format 



ERLC 



127 



that will be used when assessing the trainee grou?. ir.e 
resDonses would then b.e subjected to iter atalysis anc the _ 
appropriate item characteristics conpat^td. | rr.ose i^ens ;uagea 
Bost effective for purwjses of assessing levels of cocperence 
(based on the iten and validity criteria preser.rec ir. Cr.apter^ 
II) would be selected fcr the final mstrjxent.^ .tez^ suspect- 
ed of being anbiguous or of containing technical oe.ects can 
either be revised or discarded and replaced by new itens. 
Replaceinent itens (either froE the itsc reser-/e or newly con- 
structed) would assess tHe sane cognitive cehaviors anc 
subject matter originally assessed by the discardea irezis . 
The structure and content of defective itens should be coc- 
pared with the structure and content of their replacecents 
m order to avoid including new items with deficiencies simi- 
lar to those found m the items they are to replace. 

While Item analysis data will greatly enhance the test con- 
struction orocess, the opportunity for conducting a tryout 
of Items depends heavily upon the availability of a sample of 
examinees that is representative (in terms of education level, 
professional background, etc.) of the population of individ- 
uals who will comprise the trainee group. If such a group 
can be assembled (e.g., possibly with individuals from a train- 
ing program going on during the period of test construction) 
then It is strongly recommended that an item tryout be con- 
ducted. 

When a preliminary administration of the items is not possible, 
the procedures for test construction outlined m the text 
(pp. 8-31, and pp. 35-40) and m Appendix C, if closely fol- 
lowed, should be quite adequate for developing a valid (i.e., 
appropriate, fair and representative) test instrument for the 
assessment of changes m levels of subject matter competence. 



In addition to item analysis data a tryout of items will 
provide information concerning such factors as the amount 
of time- recjuired to administer the test of X number of 
items and the appropriateness and adequacy of test in- 
structions and format. 

Every attempt should be made to correct "and revise items 
with suspected deficiencies;, discard them only when it is 
not possible to upgrade them. This is strongly suggested 
since replacement items will not have been given a tryout 
and will be of unknown difficulty and discriminating 
ability and therefore, of questionable effectiveness. 



Er|c IoJ 



APPE^fDICES 



APPENDIX A: History of the Developtaent of the 

Methodology 

Background 

Field Applications 

APPENDIX B: Psychometric Theory Underlying the 

Methodology - 134 

General Considerations • 134 

Testing Design ^35 

Factors Affecting Mea^ureinent Validity: 

Internal & Extraneous Factors 136" 

Practical Considerations for Selecting the 

Design 

APPENDIX C: Guidelines & Rules for Constructing 
Specific Types of Objective Items, 
with Examples 143 

Multiple Choice Items . - - - * 14 3 

Interpretive Exercises 148 

APPENDIX D: Model Set of Test Instructions 156 

APPENDIX E: FORTRAN IV Computer Programs with 

Complete Documentation 159 

• Program COMSCOR 1^1 

Program TDEKMERG 

Program TRELVAR 1^^ 

• * Program CHISQUARE 1^4 

Program ITEMPAT 1^^ 

APPENDIX F: Statistical Formulas, Computations h 

Tests of Significance 166 

Testing Significance of Difference by Applica- 
tion of the t-Test for Related Variables . • • . 167 

Testj-ng Significance of Difference by Applica- 
tion of the Chi-Sguare Test of independence . . . 180 

Quantitative Procedures for Constructing Item 

Pattern Analysis Tables 1^3 

APPENDIX G: Mathematical Equation and Parameter 
Values for Generating the Weighted 
Achievement/Competence Score Curves . . . 189 



129 




14 J 



FIGURES IN THE APPENDICES 

CI Mean Number of Children by Religiosity and 

Level of Education 153 

PI t-Test Worksheet 168 

P2 Table of Values of t at the 5% and 1% Levels 

of Significance . . . 172 

P3 Sample Worksheet for Item Pattern Analysis .... 184 

F4 Iteri Response Patterns ^by Individual Item . • . . 187 

P5 . Item Response Patterns by Individual Trainee . . . 188 



^ 130 

ERLC 



APPENDIX A 



HISTORY OF THE DEVF.T/)P>ENT OF THE METHODOLOGY 



Background 

The assessment approach described in this Manual is not new 
to the field of educational evaluation. The measurement and 
statistical testing of score differences based on two admin- 
'istratidns of the same test as an indicator of the degree of 
leaming^iba^tw^^aa^ occurred over time, is a common approach 
used by instructors in academic situations. The Test/Retest 
paradigm for assessing achievement probably dates back to 
the beginning of the formal psychometric testing movement or 
even befoffe. While the evaluation of educational achieve- 
ment through the application of objective-type test instru- 
ments has been widely discussed in a number of excellent 
textbooks on educational measurement and assessment (see 
Biblicwraphy), no step-by-step, Manual-type guide has been 
available where a training administrator who would like to 
employ such methods could find the necessary information on 
the planning, construction and administration of the test 
instrument together with detailed statistical procedures for 
the analysis and interpretation of the data that results. 

The assessment procedures were originally developed in an- 
swer to a request to the Division of Social and Administrative 
Sciences from the Demographic Association of El Salvador. 
The Division was asked to assist in evaluating a series of ■ 
training programs in terns of their effectiveness in reaching 
a set of pre-defined objectives. The Association was con- 
dCicting four types of population/family planning/human repro- 
duct-ICyfl^-ttiiritng programs, .each directed toward a different 
professional an^ paraprofessional level. One major objective 
of the programs was that the participants acquire a compre- 
hensive understanding of new subject material, as well as 
the abilities to apply this hew learning to new problem- 
solving situations. A major focus of evaluation was, there- 
fore, to determine the degree of relevant learning that oc- 
curred during the course of training. It was felt that 
while this would not be considered a comprehensive evalu- 
ation, it would provide a measure of .the .degree to which the 
tra'ining programs were accomplishing some basic, short-l;erm 
objectives . 



132 



since the 'assessment of learning was to involve the measure- 
ment of change in levels of substantive knowledge as a func- 
tion of an> intervening' educational experience^ a baseline ^ 
leyel of competence from which to measure change was re- 
quired. The application of an objective-type achievement 
instrximent administered under a Test/Retest design was se- 
lected as the most appropriate app2:oach. Rather than con- 
struct the test instnament at the Institute or send staff 
members to El Salvador to conduct the evaluation, it was 
decided that It would be more appropriate to develop a set 
of guidelines for structure and content together with an 
outline of the statistical procedures zjamiired for score 
analysis, to be us^d by the program adminbsjtrators them- 
selves in constructing the test instrument and conducting 
their own assessment s tudy.^'^****'* 

Field Applications 

Based upon secondary information feedback from the Salvador 
training experience, the original guidelines for test con- 
struction, administration and analysis were revised, expanded 
and compiled into a draft Manual of procedures for assessing 
the acquisition and application of new learning derived from 
a structured training experience.* 

The methodology described, while theoretically and intuitive- 
ly sound, had not been subjected to controlled field-testing. 
The lack of first hand field experience left questions con- 
cerning the methodology's utility and validity unanswered. 
It was felt that several field applications of the method- 
ology under varying training conditions would be required. 



* The original guidelines focussed upon changes in levels of 
subject, knowledge from Test to Retest as the measure of 
program impact and trainee achievement. It was later real- 
ized that assessment of knowledge alone was too limited an 
area of evaluation since it primarily involves tasks which ^ 
emphasize remembering, either through recall or recognition (i) . 
The focus was, therefore, enlarged to encompass a greater 
range of cognitive behaviors. 




133 



The first field application was conducted at the invitation 
of the United States/Agency for International 'Development 
(AID) ; in a training situation involving a Governinpnt Spon- 
sored Population/Fafoily Planning Program Semina/^Workshop ^ 
in Washington, .D.cy The field testing was carried out fran 
September 1972 to January 1973, 

A second field testing was carried out at the request of the 
National School of Public Health, Department of Health and 
Family Protection, Rennes, France which was planning a seven 
week training program for French health workers at varibus 
professional levels. The field work was conducted frcxn 
October 1973 to January 1974. The numeric data provided in 
the section on statistical analysis of response data (in- 
cluding the data in APPENDIX F) were derived from this field 
testing. 

The thitd field study was also carried out at the National 
School of Public Health in Rennes . This study involved as- 
sessment of changes in levels of competence among health 
professionals from Francophone Africa who were participants 
in a four month Family Planning and Matemal/Child Health 
Training Program being conducted under the sponsorship of 
the Department of Health and Family Protection, The study 
was conducted from March to October of 1974. 

The content of the Manual, while derived primarily from the 
field testing experiences and preparation of the guidelines 
paper drew heavily from the writings of various educational 
specialists whose major works are cited in the Bibliography. 
Thus, the methodology described is not so much an innovative 
contribution to the field of educational assessment as it is 
a comprehensive synthesis of extant experimental design, 
qualitative and quantitative guidelines for test instrument 
planning, construction and administration and statistical 
analytic techniqiaes into a self-contained reference text for 
conducting a study to assess changes in levels of subject 
matter competence as a result of participation in a struc- 
tured training experience • 



erIc 



APPENDIX B 

PSYCHOMETRIC THEORY UNDERLYING THE METHODOLOGY 



•Rie purpose here is not to inundate the reader with an ex- 
haustive exposition on the complexities of psychometric 
theory as it relates to achievemenl: testing • The body of 
literature coocemed with th^s area is So voluminous as to 
preclude all but the most elementary and non-technical dis- 
cussion. The assessment guidelines and methodology com- 
prising this Manual should not be accepted^ however, without 
some understanding of the theory governing their effective ^. 
use. The objective here, then, is to discuss some of the. 
underlying theoretical principles involved as well as to 
identify the major problems encountered in measuring learning 
outcomes*. ' 

General Considerations 

Measuring educational achievement recjuires an objective as- 
sessment of what a group of students has learned (i.e./ their 
siabject matter cOTipetence) , in one or more relevant subject 
areas/ through a testing procedure employing a'set of .subject 
related tasks. The testing procedure must be structured so 
that all examinees interpret the tasks in a similar way (to 
provide a common basj.s for assessment), and standardized so 
that the tasks and procedures for administration and scoring 
are explicit and fixed (to ensure that the same test pro- 
cedures are followed each time an assessment is 'conducted) • 
In order that the procedures conform to an achievement test 
model/ the subject material comprising the tasks should be a 
representative sampling of the significant subject matter 
dealt with during the course of instruction. If the content 



* Procedures for translating " Subject Matter Cc^pellence " (the 
specific learning outcome under study) into operational 
indices of achievement amenable to objective assessment are 
discussed on pp. 7-24. * 



134 



of the tasks adequately reflects the relei^ant subject con- 
tent of the course work, then measures of success or failure 
in dealing with these tasks (when administered under con- 
trolled testing conditions) will provide the basis for in- 
ferences concerning 

(a) the effectiveness of the instructional sequence 
in achieving a Specific training program ob*- 
jective and; " 

(b) the magnitude of the change in th4 trainees' 
levels of subject- matter competence. 

The use of the same test results to assess both training 
effectiveness and trainee achievement is not hew to the 
field of educational /ivaluat ion. According to Cronbach (2), 
every time a teacher gives a test he is testing his in- 
structioa as much as he is testing the student's efforts 
and achievements. 

Testing Pes ion 

Xi> ord^r to reiate any change in. an individual's level of 
subject competence directly to a specific training sequence, 
a testing design is required that will provide a measure of 
the trainee's level ^of competence prior to the introduction 
of instruction (i.e., a quantitative assessment of the 
degree to-which a trainee has already acquired what is to 
be learned).' This pre-instructiohal baselijie level of 
competence is subsequently compared with^>^imilar measure 
pbtained upon the completion of instruction. A statistical 
analysis of any increases in competence levels from testing 
to retestlng will help determine whether such increases are 
significant or^simply due to chance* The degree of confi- 
dence with which inferences can be made which relate the in- 
ct.eases in competence to the direct effects of training in- 
struction will depend upon the type of test instrument ad- 
ministration design ^elected. 

The testing design employed in the Manual is the One-Group 
Pretest-Posttest Design (3) which is represented graphically 
by: . ^ 



136 



Oi X O2 

Where ; X = the introduction of a 

treatment variable whose 
effect is to measured; 

= a measurement procedure 
conducted prior to the 
intr.oductipn of the treat- 
ment variable and; 

O2 = a measurement procedure 
conducted following the 
application of the treatment 
variable • 

In terms of the level of assessment being proposed here^ v\v 
the "X" represents the structured training sequence (i.e^\,, 
an educational treatment) ; "O^^" represents the pre-instnic- 
tion and the post-instruction test performance with 

tasks sampling cognitive competence. Operationally/ then, 
assessment at thi^s level is essentially a statistical 
determination of the degree to which training instruction 
elevates the trainee's initial baseline level of subject 
competence. This implies that a change will occur as A 
result of instruction and that the magnitude of the change 
can be measured quantitatively (and related directly to 
that instruction) . 

Factors Affecting Measurement Validity 

There are a number of factors* related to the technical/ 
structural aspects of the test instrument tha€ must be 
acted upon, due to their potential confounding effects on 
the measurement outcome (to the extent that th6 test results 
can be rendered invalid for tl^eir intended use in measuring 
changes in levels of sxibject matter competence and relating 
the changes to the impact of instruction) . 



* These factors are covered in greater depth in Chapter II • 



ERLC 



147 



137 



1. Test item cons tnijet ion Items must be constructed to 



has not achieved competence in the subject area sampled 
by the item and not because the vocabulary was vague or 
too difficult or the sentence structure too complex. 

2. Item content validity Inferences concerning subject 
matter competence cannot be based on items that provide 
an inadequate (non-representative) sample of the subject 
areas and abilities covered in the instructional se- 
qu'ence. 

3. Levels of item difficulty Test performance is highly 
sensitive to and strongly influenced by items which 
are too easy or too difficult. 

4. Test directions and statements of test purpose Effects 
test performance by shaping the examinee's conception 
(and perception) of the task and by influencing his 
level of test-taking motivation. 

5. Time limits and guessing penalties Individual differ- 
ences in non-cognitive functions (not directly related 
to the test behavior being measured) may eater into the 
assessment when limits and penalties are imposed. 

The above list, while not exhaustive, calls attention to the 
fact that without proper controls test performance is vul- 
nerable to the subtle and profound influences of factors 
above and beyond those which the test purports to measure. 

Extraneous Variations in Test Performance 

In an effort to identify potential sources of extraneous in- 
fluence, it is necessary to consider the effects of other 
variables beyond those associated with the technical/struc- 
tural nature of the instrument itself, which may pose a 
threat to the internal validity of the assessment. (Internal 
validity refers to the level of confidence which can be as- 
cribed to .findings which infer a causal and direct relation- 
ship between the sequence of instruction and the level of 
subject competence.) These variables can function as 
"plausible rival hypotheses," offering alternative expla- 
nations for the Oito02 difference (i.e.. Pre- to Post-Test 




means the examinee 



138 



score increases^ rival to the inference that "X" (i.e., 
training) causes the dif^ference (4) . 

Awareness of the fact that such variables can produce effects 
confounded with the effedts of the training sequence is par- 
ticularly important given the nature of the One-Group Pretest- 
Posttest design employed in the assessment. This design, 
like many employed in educational evaluation, is a quasi- 
experimental design and, unlike the true experimental design*, 
is employed "in situ " where necessary controls cannot always 
be implemented. Further, the practical necessities of train- 
ing program operations most often preclude the use of a con- 
trol group (i.e., a group that receives both administrations 
of the test without the intervening educational treatment) 
against which to measure the significance of change occurring 
in the training group. 

Campbell and Stanley (5),in presenting this design, discuss a 
nuHiber of threats to valid inference. The following is a 
list of these factors, together with a judgment as to their 
potential effect on the type of assessment being proposed 
here. 

a. History. Between the two measurement points (i.e., 
Oi & O2) other change-producing events may 
have occurred in addition to the education- 
al treatment variable "X". 

While extraneous outside influences can produce changes 
in Test/Retest measurements of certain variables (e.g., 
attitudes and opinions), their effect on subject matter 
competence would be minimal. (Any activi^^ occurring 
outside of the formal training sessions, such as hccne- 
work assignments and informal student discussion of 
course-related Jfeopics, is an integral part of training 
and not an ex^aneous variable) . 



* A highly structured Laboratory -type situation where randocr 
assignment of subjects to treatment groups as well as other 
types of controls-are employed to reduce or eliminate the 
effects of variable^ other than those being measured. 




141 



Practical Considerations for Selecting the Design 

It can be concluded fron the above discussion that valid 
inferences concerning the effects of short-tenn instruction 
on sab3ect conpetence can be drawn from assessment proce- 
dures ezzploying the One-Gi?oup Pretest-Posttest design. It 
ZMSt be admtted,^ however , that the selection of a quasi- 
experiiisntal design was based core on necessity than on 
choice of the nost valid design for assessment purposes. 

The zest valid approach would be to conduct the assessment 
under conditions representative of true experimental design. 
That IS, individuals would be randcealy assigned to one or 
th^ other of two groups (i.e. / cin experimental group to 
>?^eive iBstniction and a control group receiving no instruc- 
L-tion> . &Dth groups would receive the Pre-Test cuid Post-Test ' 
and the changes that occurred within each group would be 
corpared. The score chamges occurring within the control 
group' (reflecting effects on scores that operate in the 
absence of training) would be statistically partialled out 
of the score changes in the experii&ental group amd the 
resulting .difference would be attributed to the training 
sequence. A causal relationship between instruction and 
significant (experinental group) post-score increases can 
be inferred with a high degree of confidence since the true 
experir^ental design can be considered as actively controlling 
the extraneous effects of history, maturation, testing, 
instrumentation, etc. The difference for the experimental 
group between Pre-Test and Post-Test cannot be explained by 
sam* effects of these variaibles as they are found to effect 
both the experir^ental and control groups (7) ; therefore, the 
change is attributed to the effects of training. 

One r^a^or working as sump tion underlies the incorporation of 
the One-Group Pretest-Posttest design into the assessment 
nethodology corsprising this Manual. This assumption is 
that the type of training situation where the assessment 
rsetho-dology'will nost often be eiiasloyed is one in which the 
only individuals available for testing are the participants 
themselves. 

Kan^' technical as well as practical consideratians preclude 
the ii3>leiaent-ation of rigid oontrols and the use of a student 
•control group* in rost educataonal settings. This is espe,- 
cially true in training situations where a sponsoring agency 
conducts a progran involving "non-resident" participants. 



142 



The training prograons involved in the field testing of the 
assessment methodology are cases in point. 

The Government Agency-sponsored Population/Family Planning 
Training Program conducted in Washington D.C. was attended 
by health personnel from a number of developing countries 
throughout the world. '"SVice these individuals were in the 
country specifically te-^rticipate in the training, it would 
not have been appropriate to divide the group randomly ^ into ^ 
two subgroups with one to receive training and the other to 
sezrve as control. Nor was it possible to secure an indepen- 
dent group of subjects, matched with the trainee group on 
relevant parameters (e.g., educational level, professional 
background, English language proficiency, etc.) to serve as 
the control. 

The lack of appropriate individuals to assemble into a com- 
parable control group was also evidenced in the field testi 
involving both the French and Francophone Alrican Training 
Progxcuns in Health and Family Protection cor ducted at the 
National School of Public Health in Rennes , JFrance , 



rimental design 
if the evaluator 



The test results obtained under a c[uasi-exp< 
can be used to assess training ef f ectivenesJ 
is willing to accept certain assumptions about what would 
have happened to the variable being assessed if the individ- 
uals had not been expos^^d to the sequence of instruction. 
Essentially, the evaluator assumes that the observed changes 
were due to the impact of the educational program and that 
the changes would not have occurred if the trainees had not 
been exposed to the progrcim (8). For example, in th^ first 
Rennes Training Program, where the assessment results display 
significant increases in levels of competence in the three 
major subject matter area^, it is an appropriate assumption 
that the trainees would not have shown such changes in a com- 
parable period pf time if they had not participated in the 
course of training. 

The inferential power of the assessment results is greatly 
enh^ced provided that systematic guidelines in test con- 
struction cmd administration arc implemented and appropriate 
statistical tests cmd procedures are employed in the aiialysis 
of resulting test data. 



ERIC 



APPENDIX C 



GUIDELINES AND RULES FQP CONSTRUCTION OF SPECIFIC FQR>S 
OF OBJECTIVE 1^$T ITE^S WITH EXAMPLES (9) 



Constructing MuItiDle-<:hoice Items 

The standard multiple-choice item consists of a st^ and a 
set of alternatives or response options . The stem can take 
the form of either a ccHnplete question or an inccHnplete state- 
ment while the alternatives provide possible answers or cccn- 
pletions of the statement. |The ' alternatives will consist of 
one correct or best response together with t^ or more mis- 
leading options, called dis tracters . ) The foT-lowing rules, 
guidel'nes and suggestions a^e based on this standard design. 

1. A definite problem should be recocmized froni the item 
stem > The test taker should be able to tell, frc^n reading 
the steip of the item, what kind of competence he is expected 
to demonstrate in answering. An item with an inconplete idea 
in the stem, meaningless in itself, will be confusing and 
take more time to figure out. An example: 

Developing countries : 

a. rarely formulate population policy. 

b. have strong conservative elements operating 
against the adaptation of family planning. 

c. must develop population policy' in order to 
set goals and mobilize resources. 

d. are among the most interesting places m the 
world. 

Here the test taker lis forced to read each response before 
knowing what information is being looked for. lncor:^plete 
ideas m the stem generally make it necessarv' to write lengthy 
alternatives, and the alternatives will frequently cover a 
nujnber of unrelated ideas. It is best to include as much as 
possible m the stem, to ensure uniformity m the alternatives^ 
and reduce reading time. The example would be better if worded 
as -foliovs : 



If) t 

143 



144 



National population policy is important for developing 
countries because it will: 

/ 

a. h^ve a direct effect on the size of the populat/lon. 

b. set goals for allocation and mobilization of re 
^i^ources . 

put them in the company of the advanced nations 
d. make them more attractive to visitors. 

Here the stem of the item meets a criterion which serves ^s a 
useful check on adeqiiacy of initial problem statement: 
could be used as a short-answer type item, as 'Vhy is nat/ional 
population policy important for developing countries?" 

2 . Avoid having to repeat words in each alternative . It such 
words are included in the item stem, the clarity of the item 
will be increased and reading time decreased. Thus the item 
that follows: 

A limitation of teaching by external rewards is that: 

a. punishment is more effective than external rewards. 

b. many students will not be influenced by external 
rewards . 

c. the learner's behavior may not change as a result 
of external rewards . 

d. external rewards may become more important than the 
act itself. 

might be better worded ^s belcw: 

One of the disadvantages of the use of external rewards 
in teaching is that external rewards are likely to: 

a. be less effective than punishment. 

b. influence only a few of the students. 

c. change the learner'^ behavior. 

d. become more important than the learning itself. 

3. Avoid negative stater>ents m stems and responses , Vnless 
significant leariing outcomes require them, negatives (i.e., 
no, not, least) are best avoided because they are easily over- 
looked. While test takers are expected to read items and 
responses carefully, it is unfair to penalize someone for so 
obvious an oversight. Also, the learning outcccnes should 



ERIC 



145 



stress tlie acquisition of and the ability to use and apply 
the best or most important methods, principles, facts, theories 
and not the ability to select the "exceptions to the rule" as 
measured by the typical "negative" item. If for some reason ^ 
you must use a negative, underline it: e.g.. Which one of the 
following is not a type of oral contraceptive? 

4 . Use novel material an^ situations in formulating problems 
that aim to measure understanding of or a bility to apply prin- 
cipled . As in the case of items taken verbatim from books or 
lectures, you may end up measuring ability to recognize or 
remember material (rote memory) , rather than ability to use 
what was learned. Of course, new material must be carefully 
selected; it should not require kncwledge and/or understanding 
of areas not coffered in the course. While the situation must 
be new to the examinee, try to select material as close to the 
illustrations used during the course as possible. 

5. Be sure no unintentional clues to the corre ct answer have 
been written into the item stem . There are many ways in which 
clues can slip in. Socne exconples follow; 

A) In family planning education programs built around 
the availability of transistor radios in particular 
rural areas, one key el^ent should be: 

a. scheduling of programs when particular audiences 
are likely to be listening. 

b. talks by university professors. 

c. scheduling prograjns when children are asleep. 

d. standardizing the message for all parts of the 
country . 

Here the clue is the word "particular," which appears both in 
the stem and the correct response. rne test taker will be 
likely to see this association and pick correctly. The best 
way to deal with this exam.pl^ would' be? to take "particular" 
out of the stem, w^.ere it doesn't &dd anything to the meaning 
anyway . 

3^' The Ministry of Health "has corrnonly been selected 
as the principal organization to run pop*-latiors> 
• programs because: 



ERLC 



146 



a. it usually has responsibility for major 
activities concerned with population growth. 

b. it is ^always well financed. 

c. its clinics can provide services needed for 
a popu la t ion program , 

d. the medical profession has never failed to 
initiate and operate new programs effectively. 

Here the clue is m the use of the words "usAi^lly, " "always/' 
and "never" m three of the responses. Answer c is the only 
unambiguous alternative and thus most likely to be chosen. 
Ambiguous terrns such as these should be avoided in any case, 
but using them in sotne responses and not in others will of^ten 
give the answer away. 

C) The net reproduction rate measures an: 

a. annual increase of births over deaths. 

b. annual rate at which women are replacing 
themselves on the basis of prevailing 
fertility, assuming no rpigration, 

c, decennial grc^^^th rate of the population, 

d, per generation growth rate. 

The article "an" can only go with the two alternatives that ♦ 
begin with va.cels ( a & b) , thus reducing the choice to two 
alternatives. Items should be read over carefully for gram- 
matical matters, particularly for grammatical agreement 
between the item stem and all the responses. 

In addition, the item above gives a further clue in the great 
difference in length hetr^een the correct response and the 
other alternatives, (Since correct responses usually require 
qualifications, they tend to be longer than the dis tracters . ) 
Be sure that you don't give away the answer by trying to 
squeeze in all the information needed to make it correct, un- 
less you lengthen the other alternatives as well, 

D) vThen demographers refer to the -4Dopulation pyramid," 
what are they referring to? 

a, A mathematical formula for predict^ing population 
trends , 

b , A pictorial representation of the distribution 
of the population by sex. 



ERIC 



147 



c- The hierarchy of the staff of a population/ 
family planning agency, ^ 

The answer is partly given away by reference to a '^pictorial 
representation/' which easily refers back to the "pyramid" ip 
the stem. To help make this item less easy to guess, either 
the phrase "pictorial representation" could be tak^ out, or 
the item could be reworded as follows 

In demography, the "population pyramid" is a 
pictorial representation of: 

a. a mathematical formula for predicting popu- * 
lation trends . 

b. the distribution of the population by sex. 

c. the hierarchy of a population /family planning 
program. 

6. Avoid responses that overlap or i nclude each other. In 
the example below, answers b and d include answers a and c: 

An average annual gra>/th rate of 2.8% leads to a 
doubling of the population in: 

* 

a. under 15 years . 

b. under 25 years . 

c. over 50 years . 

d. over 100 years . 

If the answer was, for example, 7 years, both a and b would 
be correct. The chancer of guessing would be improved. 

7^ Do not use a pair of opposite stateme nts as alternatives 
if one of the pair i& correct . Most test takers will limit 
their choice to one of the two opposing statements, thus 
reducing a four-choice item to a two-choice item, as in the 
example: 

The doubling of the population expected m the next 
4h years is likely to have what effect on the growth 
rate of total income? 

* 

a. The productivity of investment will :,ncrease. 

b. There will be no change. 



148 



c. The rate of savings will increase. 

d. The rate of savings will decrease. 

This problem can be avoided by employing two pairs of oppo- 
sites or eliminating the use of opposites altogether. 

8. use the alternative "none of the abo^^^" only when required 
to measure specific learning . Only in cases where a trainee 
must be able to determine things that do not apply should 
"none of 'the above" be used. 

Its most appropriate use would be with items requiring numer- 
ical computations where the responses can be classified as un- 
equivocally correct or incorrect. If it is used frequently^ 
it must be the wrong answer some of those times. When it is 
the right answer, the^alternatives that do not apply must be 

plausible, but must also in fact not apply. 

« 

9. Avoid ^the use of the alternative "all of the above ". The 
alternative "all of the above" creates two significant dif- 
ficulties. First, test takers may recognize the first re- 
sponse as correct and mark"" it without 'reading all of the 
alternatives. Second, a test taker may recognize two of the 
alternatives as being correct, and not know about the third. 
He will still get the item correct without complete under- 
standing, however, by marking "all of the above." It is 
better in these cases, to make the alternatives into a list, 
and then ask the respondent to check which are correct: 

i 

a. 1 & 2 

b. 1 & 3 
^ c. , 2 & 3 

d. All of the above 



Interpretive Exercises 1 

\ 

An interpretive exercise consists of a series of objective 
j.cem:9 based on a commpr^ set of data (written material, tables, 
charts, graphs, maps or illustrations). Test items are most 
conmonly of the multiplb-choice or alternative response type. 




ERIC 



lf)J 
• 



149 



Since all test takers are presented with a coinmon set of data, 
it^ is .possibfe to measure a variety of complex learning out- 
comes. • Test takers can be asked to apply principles, in- 
terpret relationships, recognize and state inferences, re- 
cognize relevant information, develop hypotheses , formulate 
conclusions, recognize assumptions, recognize limitations, 
state significant problems, and design experimental pro- 
cedures. All these are indicators of complex achievement. 

The most common method of getting students to demonstrate 
these abilities has been to ask them to write an essay. The 
main advantage of the interpretive exercise over the essay- 
type question is derived from the greater structure provided 
by the interpretive exercise. Test takers cannot redefine 
the problem, or arrange their answer to demonstrate only 
those thinking skills in which they are most proficient. 
The series of objective items forces them to demonstrate the 
specific mental abilities called for. It also makes it 
possible to measure separate aspects of problem-solving 
ability and to use objective scoring procedures. 

The validity of exercises measuring intellectual skills may 
be questionnable in terms of a Test/Retest instrument except 
in courses specifically designed for the develofnnent of sudh 
skills. However, it is felt that objective-type exercises 
can be useful in determining the trainee's ability to a^ply 
new learning, or to reason in a subject area with which he 
has become familiar during the course of instruction. In 
addition^ the amount of factual material given in the exercises 
or asked to be provided by the pupil can be controlled: defini- 
tions pf terms, formulas foi: calculation, and the like, may be 
either provided or withheld, thus regulating the difficulty of 
the ^est item measuring achievement. of a specific learning out- 
come 



snstructinq Interpretive Exercises 

lere are two major tasks involved: selection of appropriate 
introductory material and constructing a series of dependent 
items. Special care must be taken to construct test items 
that require an analysis of the introductory material — items 
that simply measure reading skill or rely on general informa- 
tion apart from what is contained in the niaterial are not 
useful for the f^urpose for whicli the exercise has been in- 

IErJc * IGu) \ 



150 



tended. The following suggested guidelines will aid in con- 
structing valid interpretive exercises. 

1. Select introductory material that is in harmony with the 
objectives of the course . Interpretive exercises, like other 
testing procedures, should measure the achievement of specific 
instructional ^oals. Success in this regard depends to a 
large extent on the introductory material, since this provides 
the cc«nmon setting on which the specific test items are based. 
If the introductory material is too simple, the exercise may 
become a measure of general information recognition, recall 

or simple reading skill. On the other hand, if" the material 
is too complex or unrelated to instructional goals, it may 
become a measure of general reasoning ability unrelated to 
specific learning outcomes. Both extremes must be aVoided. 
Ideally, the introductory material should be' pertinent to 
the course content and complex enough to call forth the 
mental responses specified in the course objectives. 

2 . ' "Select introductory material that is new to students . 
In order to measure complex learning outcomes, the content 
the introd.uctory material must contain some novelty. Asking 
students to interpret materials identical to those used in 
instruction- provides no assurance that the exercise is measur- 
ing anything other than rote memory. Too much novelty, however, 
must be avoided. Materials similar to those used during the 
course but which vary slightly in content or form are most 
desirable. Such materials can usually be obtained by modifying 
selections frOT textbooks, newspapers, news magazines, and 
various reference materials Pertinent to the course content. 

3 . Select ir.tro^uctor^/ material that .is brief but meaningful . 
One method of minimizing the influence of general reading 

, skill on the measurement of complex learning outccxT\es is to 
keep the mtrodactory material as brief as possible. Digests 
of articles are frequently ^ivailabl^ and provide good raw 
material for interpretive exercises. VTnere digests are un- 
available, the summary of an articl^e or a key oassage may 
provide sufficient material. \ti some cases, the relevant 
infonnation is sunm.arized more ade^-ately m a table, diagram, 
or picture. 

4 . Revise introductory- material for clarity, conciseness, and 
greater interpretive value . Although socTie materials (for ex- 
amole, graohs) can be used without revision, m.QSt selections 



151 



require some adaptation for testing purposes. Technical 
articles frequently contain long, detailed descriptions of 
events. On the other hand, news reports and digests of 
articles are brief but frequently present exaggerated reports 
of events to attract reader interest. While such exaggerated 
reports provide excellent material for measuring the ability 
to judge the relevance of arguments, the need for assumptions, 
the validity of conclusions, and the like, the material must 
usually be modified to be used effectively - 

5 . Construct test items which require analysis an d^interpre- 
tation of the introductory material . There are two common 
errors in the construction of interpretive exercises which 
invalidate them as a measure of complex achievement. One is 
to include questions which are answered directly in the 
introductory material -- that is, asking for factual infor- 
mation which is explicitly stated in the selection. Such 
questions measure simple reading skill. The second is to 
include questions which can be answered correctly^ without 
reading the introductory material — that is, requiring 
answers based on general information in the area'. These 
questions, of course, merely measure simple knowledge out- 
comes . 

If the interpretive exercise is to function as intended, it 
should include only those test items which require pupils to 
read the introductory material and to make the desired 
interpretations. In some instances , the interpretations will 
require pupils to supply knowledge beyond that presented in 
the exercise. In others, the interpretations will be limited 
to the factual information provided. The relative emphasis 
on knowledge and interpretive skill will be determined by 
the specific learning outccKnes being measured. Regardless 
of the emj^asis, however, the test items sho^jld be dependent 
on the introductory material, while at the same time calling 
forth mental responses of a higher order than tr.ose related 
to simple reading ccmprehension . 

6. Make the number cv f test itenis ro^jah lv proportional to 
the length of the introductory material . It is inefficient 
to have pupils analyze a long, cocplex selection of material 
and answer only one or two questions concerning it. Although 
it is impossible to specify' the exact number of questio--<s 
which should accccpany a given aroxint of material, the items 
presented as examples in this section illustrate a desirable 



152 



balance. Whenever possible, it is r^est zz -se as. exercise 
that has brief introductory raterial and a re lat ive 1 ; larce 
number of test items. 

7 . In constructing test iters for ar. interpret :ve exercse , 
observe all pertinent sjgges tiers for ccr.str^cti'c cr;ective 
iteros . The forr of test iten usee m tne i r^terc re 1 1 ve 
exercise will cetemne tne sjcgesticns fcr ccr.str -c t : cr 
which have greater valje. If cC'r:rcn fcr^s cf tre '"-It-cle- 
choice or alternative -response iter are -see, tne st«ecific 
suggestions for ccnstrjctirc trese iter t.p«es s*c-lc c-e 
observed . VTnere "modified f crr-s are - = ed. s-ccest.c-s ft r 
constructing eacn cf tne varic-s types cf ci;ect-ve -ter^ 
ahould be renewed fcr tneir applicaril.ty cc"=tr-ctic~ 
rreedor fr 

inportant. m interpretive exercises as 
independent test iters. 



SAMPLE :N7E?j^?^r.^ r:>:r:?c:si 







lERLC \ 



s. 



t. 

.1. ^ 



.1. 

r 
J* 



« li^*^-* *C ^: : : 



ERIC 





























I" 


t 1*: •< 


»• - 








































* . 




























' 1 






•—^ « 
















: r^L - 



















m~ ^ - 




ERIC 





c 










us. 
































































































ff 




♦'^ »i » t « ! • ' • • ' * 
































t - 


- 






: 


L .^^-^-^ * 










1 1 




















• 


•A 






«■« 






r "a 
































































1,1— 














* 




* 














































EMC 



hERIC 



hERlC 



ERIC 



ERIC 



ERIC 



ERIC 




9 




ERIC 



mm 



r 

% 1S7 




Fa 



TESl'ING SIGNIFICANCE OF DIFFERENCE BY 
APPLICATION OF THE t-TEST FOR REI^ATED VARIABLES 



The computational formula for computing the t-statistic is: 

t-= _E- 



u where: M ^ Mean Difference Between 

^ Test and Retest Scores 

D = Standard Error, of Mean 
Difference Between Test 
^and Re-best.. Scores. 



Comput^^^onal Procedures 

(A sanple t-Test mn is illust^rated in Figure Fl, References 
to the appropriate computations presented in the figure should 
be made as each successive stage is presented in the discus- 
sion. ) , 

! 

1* Compute for each examinee the difference between his 
Test and Retest scores. This is done in the column 

/ labeled D. It*maJces rto difference which score is 
subtracted from which Test from Retest or 

Retest from Test) as long as the procedure is carried 
out ^n the same way ^or all excuninees. • 

Comment : The result of Step 1 is; the distribution 
of direct score differences from which all further 
computations will derive. 

2 a. Compute the algebraic Mean of the score difference' 
'(^) . First, sum all positive D values and sum all^ 

negative D values. Then subtract the sum of the 
negative from the sum of the positive D values' to 
. • obtain ZD. 

Comment: In the sample ^n, Retest-Te3t differences 
P 



168 



PiaURE PI 
t-TEST WORKSHEET 



Iteia Set 1 - Rtnne§ 
11/73-12/73 



2XA1CNEE 



ITD4 SET 



ID 


_XI k 


Rl 


L- 


U 




01 


5^ 


34 




06 




02 


50 






1^ 


m 


03* 


/3 


i1 




Ob 




oT- 




43 




/7 




05 


^ 


3W . 




f3. 




oT" 




37 




G2 




07 


33 


33 




0 


0 


-oB— 




31 






64 - • 


6^ 




3JL . 






\L 


10 




3? 




(iS 


IS 


11 " 




54 




09 




12 




^9 










■ in 


'3.r 




r7 
































^4 






" 400 - 








r 




100 V 




in 








Art 


19 


li- 


36 




M 




20 


if • 








000 


21 




37 




or 




22 


09 












in 






Off 


















-as 


3f 




10 


/oo 


-n- 


^3 


3^ 




15- 


J2^6' 




^7 










-i- 










, Q 


29 




-il- 




M 




30 








r3 




31 








or 






7S7 




rD(+)«r 
s:d(.)- 

rD - 


0 

340 





DECISIOH i p<.o/ 



19 i 



were computed, ali of *which were positive. 

Divide the ID by N (number of examinees or pairs of 
raw scores) to compute the Mq :^ ^ ' * 



ZD 



3 a. Compute «|or each examinee the square of the Test/ 
Retest score difference. This, is done by squaring 
• each of the* values "in the D column ^nd recording 
them in the colui^n labeled d^^ * 

b. Compute the Icf2 4Dy summing all D^ values. 

4. Compute the standard deviation of the distribution 
^f d,if f ejrences (^d) • The computational 'formula is: 

^ N 

5. Compute the standard error of *the mean differences 
(oMQ)using the formula: - 



Compute the t-statistic from the t-r$*tio as follows: 





170 



Inte^rpretinq the t-Statistic * ' 

a* Determine the value of t required for significance 

at the 5% and/or 1% level and beyond f oj?- N^-1 degrees 
of freedom*. These values are provided in Figure 
. F2. ^ ' 

Conmtent ; In the sample run the t values, for 30 
degrees of freedom {31-1> vere used. 

b. If the coi^puted t-value is ecfual to or greater, than ; 
the value ijequired for significance at the*' 5% (or 1%) 
levfel there- i^ -a statistical basis for inferring thafe' 
the Retest score gaih was significant and therefore, 
'th^t fch6 trainees Sire significantly more comfSetent^ 
^ with certain subject matter in^ th^ post -ins true t ion 
period than in the pre-instruction period. Further^ 
if proper testing controls are employed as defiiied 
in the. Manual, .such significant score^ increases 
, (and therefore increases* in the levels of competence) 

can be related to the effects of the training -expf- 
-'rience. » 

.* 

Conversely, if the derived t-value is less than the 
valu^ required for significance at a certain leve^l 

(i.e., either th^ 5% or 1%), then it cah be con- 
cluded that there is no evidence^for significant f 

increases in levels of competence from initial ^ 
testing to re testing. 



'* Although both the 5% and 1% levels §ire provided; it is 
understood that only one 'or the other level will be used 
for the application of .tests of signif icanc-e to a specific 
body of data. That level should be 'selected, according 
to the rules for proper statistical^ testing, on an '|a 
priori" basis by the evaluator. 




19u 



Sample t-Test Run 

For the data in Figure Fl, the t-ratio is 10.97/1,10, giving 
a t-value = 9.97, When applying the t-Test for Related 
Variables, the number of degrees of freedom (df), to u^e 
when ent^ing the table of t-v^lues for various significance 
levels (Fig> F2) , is N-1 where N is the nupiber of examinees o 
for whom both Test and Retest responses have been obtained 
(in this case, 31-1=30) • With' 30 df, the i-statistic is 
jsig^ficant beyond the 1% level; therefore, it can be con- 
cluded that the, trainees ^increased their levels of com- 
petence with the subject matter tested by Item Set 1 to a 
significant degree fr^ the pr6- to post- instruct ion periods'. 
Furthermore^^ since all the proper controls were employed . 
during the entire phase of instruments cons tructionWnd 
administration, there is no evidence . for ^nf^rring Vhat the 
competences increa^es were due to factors other thanS^he 
direct effects of instruction. 



/l72 



FIGURE F2 



TABLE OF VALUES OF t AT THIS 5% & 1% LEVELS OF SIGNIFICANcfe 

Degrees of ' 

freedom (df ) ' ^5% 1% - 

' 1 — 6.314 ' 31.821 

'2 ' . > 2.920 6.965 

i 3 , . / 2.353 '^4.541 

4 ' ' ' 2.-132 - 3.747 

5 / 2\015 ^3.355 

6 1.943. . 3.143 

7 > 1.895 * 2.998 
^ • 8 4 1.860 . 2.a9$ 

9 ' 1.833 , ^ 2.821 

10 1.812 ' ' ^..764 

^11 • L.796. 2.718, 

' 12 1.782 / ' 2.681 

.13 1,7^1 ' y 2.650 

• 14 . • 1*761 / 2.624 

15 1.753 • .2.602 

16 . 1.746 • 2.583 

17 I 1.740 * 2.567 

18 • ^ ' 1.734 ^.S52 

19 I 1.729 ' 2^.539 
20. V ^ 1.725 ^ ' .2.528 

21 \ . 1.721 V 2.518 

22 1.717 2.508 / 
' 23 1.714 ' " .2.500 

24 ^ 1.71X , 2.492 

25 1.70e . 2.485 1 

26 ^ / 1.706 ' 2.479' ' ' 
, • 27 " ' 1.703 2.473 

' , 28 * ' 1.701" ,2.467 

29 1.699 • 2.462 

30 ^ 1*697 2.457 

\ * - 

40 : 1.684 . 2.423 

^ ■ 60 ^ r.67l . 2.390 

V 120 . 1.658 2.358 



^ The numeric data in ^his table are adapted from Table 7 of 
JJohn T. Roscoe: i^indamental Research Statistics for the 
Behavioral Sciences, published by Holt, Rinehart and Winsto 
^ ific.. New York city, 1969, p. 293* ^ 



173 



AN ALTERNATIVE TO THE ^STANDARD t-TEST* ' • . ' , • 

Two. types of correct answers contribute to an uncorrected 
(for guessing; see p- 90) achievement test score: answers 
guessed correctly and corre^:t answers based Upon true 
competence with the subject matter under assessment- The 
standard t-Test does not take tltese two components of a 
, total score into account. Therefore r there might t>e some 
xjuestion as to whether a significant .Test-to-Retest score 
increase reflects ei true increase in levels of competence 
or an increase in the number of items guessed correctly. 

The most appropriate ^significance test to employ is one 
' which 'attempts' to partial out the contribution of chance 
factors (guessing correct) to total score. 

The statistical test introduced here was designed speco^fi- 
cally for* application to mean scores derived from the 
administration of an ofc^jective assessment instrument under 
a ^est/R^test design . In contrast to the standard test, 
the test variation takes into account and attempts to 
partial out the contribution gf chance /in order to obtain 
a more valid evaluation of the* changes in level-s of 
competence. ^ 

^ NOTE : The test variation is offered as an alternative 
to . the. scalidard t-Test (rather than recommended 
outright) becfause it is a new procedure / its validity 
yet to-be established through repeated application. 
Therefore, th«^*iiew ^:est cannotn^ at this time, be con- 
sidered as a replacement for the standard t-Test. 
Nevertheless,, we feel that it is a valuable new tech* 

< ■ 



* ^The new procedure is referred to as the "z-variation" of 
the standard t-Test. Although' it can ^e applied to' the ^ 
same test data and is appropriate for small sample testing 
(i.e., where N $.30). the z-variation is not a t-Test. The 
distribution of the z-st*tis3:ic is Jiormal, unlike the 
t-statistic which has /a S^dent's dPistributton. , 



, ... .493 ^ 



ERJC- 



174 



^nique, and one that is probably more appropriate than 
•'the standard test in this situation. We would hope 
that users of this Marvual who are familiar with statis- 
tical inference will employ this procedure and assess 
its validity. We would appreciate hearing from those, 
who do apply our procedure to their own testing results*, 
Th^e results of applications and outside comment's on'±he 
appropriateness and validity of the procedure for its 
intended purpose would be valuable. 



Description of the Procedure 

The steps involved in employing the alternative test will 
differ according to whether the number of response choices 
is constant or variable' across test items. Both situations 
will be considered. 

Situation I (the number of item response alternatives is 
constant) . 

a) Definition of Variables : ^ 

XI = number'of items Known (i.e., based on subject 
competence) and scared right on Pre^Test* 

X2 = numbef of items known and scored^ right on 
Post-Test 

Yl = number of items ^scored right on Pre-Test 

(i.e., the sum of items guessed correct and 
correct items based on competence) 

Y2 = number* of items scored right on the Post-Test 

T = total number of items (i.e.. No. examinees X 
No. items per testing) 

^ N = number of . alternatives per item (N in this 

case is a constant) 

The test procedure will determine if an observed score' 
difference (i.e. , ^Y2-Y1) is statistically significant. 

b) ' The Computational Formula 

The 'formula for computing the z-statistic from the 
' • ^ raw data is* provided below. (The derivation of the 

, * ' / ' formula is provi<^ed on p. 177-1790 

«> , • 

ERJC 200 



c 



175 



^ ^ N/^ (Y2 - YD 



2(T - Xi+Ii) 



c) Computational Procedures ■ r 

A simple worksheet for computing the z-statistic 
can be constructed similar to the t-Test work- 
sheet illustrated in Figure Fl. 

^The only raw data required for the computations 
are the distribution of raw scores by trainees 
for both Test and Retest (i.e., the data in the 
columns labeled "Item Set 1 - Tl/Rl" in Figure Fl 
All input values for the z-statistic formula are 
derived from this set of data. 




Sample Computations ; 

Yl = 737 (Sum of scores in Col. Tl) 
Y2 = 1077 (Sum of scores in Col. Rl) 
N = .4 (Number of response alternatives per item) 
T = 1581 (Total number of possible responses per 

Item Set per testing — 31 examinees x 51 
il^ems in Set 1) 

Substituting these values in Formula {1) , we get 

« 

yr (1077-737) 



2 = 



/ 



2 (1581 - 1022+237) 
2 (340) ^ ^ 



v/2 (1581-907) /1348 



680 18.52 



36.71 



Interpreting the z--statistic ' ^ 

a) If the derived z-value is greaijfer tShan or equal to 
1.96 but less' than 2.58 (i.e. ? 2'.58'>z > 1.96) , the 
difference is significant at 'th^ 5% level and evi- 
dence exists for inferring a significant increase 
in the overall trainee group's level of subject 
matter competence. 




20.i 



J. 



176 



\ 

h) If the derived z value is greater than or equal to, 2. 58 
(i.e., a22.58), the difference is significant at the 
1% level and evidence exists for inferring a signifi- 
cant Pre- to Post- instruction increase in overall 
levels of subject matter competence. 

c) If significance at or beyond other levels are required 
(e^g. / 0.1% level), the critical values can be found 
in the "Table of Cumulative Normal Probabilities" in 
any standard textbook of statistical inference. 



Situation II (the number of response alternatives is 
variable across items) 

The testing procedure will be the same as for Situation I 
with the exception of one additional step. Whereas in 
the first case N is given, here it will be a derived 
value — i.e., the average number (hannohic mean) ^of 
response choices per item. • ^ 

In addition to the variables provided above (see** a)* under 
^ Situation I), the following variables will be defined:* 

a =/number o^ it^ms with nl choices 
b =1 number of items with n2 choices 
c"^ number of items with n3 choices 
t - a+b+c = total number of items in test ^'Ik 
X - number of items correct on basis of subjeof*^ 
^competence 

» y = number of items guessed* correctly 

w = number of items guessed wrong 



* Fqf illustration purposes the number of response choices 
range frotn Nl to N3. More variables (e.g., <f = n4/ e => n5, 
etc.) can be added to the harmonic mean formula to accommo- 
date a wider range in the number of response choices provided 




2o;j 



177 



\ 



a) Based on the assumption that the probability of 
- knowing (i.e., in this case having the subject 
competence to be able to answer an item correctly) 
the correct answer to an item is independent of 
the number of- item choice^, 

t - X f a_ + \ 

^ 9 = t ^ nl-^ n2 n3 ^ 

(the value of g in tiie case of the same number ^<of 
choices across itenj^ is t - x ) 

, n 



therefore, 



y^'- X t - X , a ^ b ^ jc 



and, 



^ nl n2 n3 ^ 



a+b+c 
nl n2 n3 



(2) 



b) Once n is computed using formula (2) , the value 
of n can be substituted f/r N in formula (1) and 
the z-statistic computed and interpreted. 



Derivat'jon of the Computational Formula for the z-Statistic * 



The general definitional formula for the z-stati is: 

ij _ Y2-X1 
" 'SE(Y2-Y1)' 



-a* " ' — — I 



— r ' 

* The variables employed in this section were previously defined 
(s^4 p. 174). 



i 



178 



0 

where: 

* * 

'Y2-Y1 = margnitude of the difference 
between aggregate Test and 
Retest scores. 

.SE(y2-yi) = Sx-andard error of difference 

between Test and Retest scores.. 

eJror^!sEf°ofYp''vi^^^ computational formula, the standard 
^ * i . ^^-^^ (i.e.i, for items guessed correctly) must 
- o\A ^'^^ "V refers to the "JarSe 

v(y2-yi) = v(y2) + v(yi) 

but, VV12) = v(X2 + NO. items guessed correctly on Posf-Tte 

= \ (1-i) (T-X2) 

- Similarly, v(yi) = i (,t-x1) 
merefore, -v(y2-ylj = 2^ a-\)A'^ - ^^^^^) 

fii!°f ''^^"^^ °^ ^""^ cannot be known, each Will be 

expressed in ferms of yl, y2 and T as follows: 



(yi+yZ - (.X1+X2)') = 2T - hn-X2) 



or, X1+X2 - f (Y1+Y2 ) - T 
2 N-1 



ther^re, v(y2-yi) = 2.l\i-i) /t - J (Yl+y2) - t 

N 2 _^ ^ 

. _ 2 • Yl+y2 , 



179 



*~the SE(y2-yi) 



' /2 .■YUY2~ 



z = 



the derived computational formula is:' 
y2.yi ^ ^ (Y2-Y1) 



/? YUY2 2 ^^ --^7 

V N " 2 ^ '^i^ ' 

C^e seonpling distribution of the z-statistic is normal 
with a mean of 0 and a standard derivation of 1.) 



180 



(see pp, 90-91) 

Fb 



TESTING SICailFICANCE OF DIFFERENCE BY APPLICATION 
OF THE CHI SQUARE TEST OF INDEPENDENCE 



In order to illustrate the computational formula for deriving* 
the Chi Square (x^) Statistic, the observed frequencies in 
the 2X2 contingency table (s%s text/ page 91) can be 
symbolized as follows: 





4f 


• 






IT^ 






Testinq 


Correct 


Incorrect 




Pretest 


A iSO) 


B (53) 


A 


Posttest 


C (78) 


D (35) 


C 




A + C * 


B + D 


N 



N (=A+B+C+I 



2 • 

Using the above scheme, the computation of the x Statistic 
is carried out according to the formula 



N (AD--BC) ^ 



4> 



(A+B) (C+D) CA+C) (B+D) 



Sample Cccnputation , 

Employing the Composite Score Data for examinee #1 (as shown 
in Figure 5, p, 52, and reproduced belpw) the chi square 
•statistic is computed as follows': 



ERLC 



20G 



181 



Pre 


60 


53 


4 

113 


Post 


78 




, 35 




113 


• 






X ^ 






> 


• 138 


« 


88- 


226 


2 

V 
A 




226( (60 X 35) - (53 X 78)" 








(113) 


(113) (138) 


(88^ 








226 (2100 - 4134)2 










(12769) 


(12144)' 






■ \ 




226 (4137156) 







1^5066736 
6.03 



t'55067 



Xnterpretinq the Computed Chi Square Varlue 

In the sample 2 x ^,,.table, the observed cell frequencies are 
classi^ed two ways: by "correct vs incorrect" items^ and by 
time of testing (Pre-Test vs Post-Test) . 

" 1 - - 

In terms of testing for significant differences, the essential 
ques-tion is whether or not the two ways of categorizing the 
observed cell frequencies are independent of each other. 

If the two ways of classifying are independent, then the 
distribution of correct and incorrect item responses does 
not depend upon the tiitte o€ testing. This is the same as 
stating that the Pre-Test scores (i.e*., the distribution of 
correct responses) do not differ significantly from the 

Post-^es^ scores . 



ERLC 



I 



182 



If the categorizations are not independent (i.e., if they 
^re correlated) then there^is evidence for the fact that the 
Tost-Test differs significantly from the Pre-Test in the 
distribution of correct arfd incorrect item responses. 

As stated earlier, the 5 and 1% levels of significance are 
used for significance testing in the analysis section. The 
values of x required for significance (for any 2X2 
table) arev 

3.84 for significance at (or beyond) the 5%. level 
6.64 for significance at (or beyond) the l% level 

When the Null Hypothesis (that the Pre-Test and Post-Test, 
distributions of correct and incorrect responses ate in- 
dependent of each other) is_ rejected at (or beyond) the 5 or 
V'^ level of significance, the aljfernativQ hypothesis that the 
Pre-Test and PoftflrTest distributions are correlated, and 
therefore sign^^antly different, is supported. Furthermore 
if the Null Hypothesis is rejected, and at the same time,'' the 
Post-Tes^ score is greater than its Pre-Test counterpart, 
then it can be concluded that the Post-Test score gain is 
statistically sig^ficant. If the assumption that appro- 
priate controls (discussed in* the text) were employed during 
test construction and administration is accepted', then there 
is evidence for inferring that the significant score gains 
reflect increases in subject competence bi^ought about as the 
result of training instruct^n. 

(For the sample data, the computed x value of 6.03 was 
s-igiiificant beyond the 5% level. Furthermore, the fact that 
the Post-Test scor^ "was higher than the Pre-Test gave support 
to the iiiference cited above for the positive impact of in- 
struction on increasing the levels of general or composite 
subject matter competence for trainee #1.) 

When, on the other hand, the computed X^value falls* short of 
significance at the 5% or 1% level (whichever had been pre- 
selected) , there is jio unequivocal evidence for significant 
statistical differences between Te^t and ketest score dis- 
tributions. Inferences of positive effects of instruction 
on subject matter competence for the specific examinee 
under assessment would not be supported- 



ERLC 



183 



(see pp. 91-98) 



Fc 



QUANTITATIVE PROCEDURES 
FOR CONSTRUCTING ITEM PATTERN ANALYSIS TABLES 



To illustrate the construction, a set of datk from the Rennes 
field testing will be used, consisting of the responses of the 
31 trainees to the 51 items of Set 1. # 

Figure F3 illustrates a worksheet used for recording scores 
from individual answer sheets/ so that. all the information is 
on one form (where the number^ of 'trainees or items is too 
large for i-nclusion on a single sheet, trainees or items can 
be broken into subgrpups and several sheets us^d, with totar*^ 
added-up on^ a cover sheet). 

Dhe worksheet is set up so that the number of correct and in- 
correct responses on the Test and Retest may be totalled both 
ror each trainee and for each item.. In addition, the direction 
of movement of each, item from Test to Retest is shown. 

Across tl^ top of the sheet, the number of each item is< entered. 
Down the Xeft hand column, the ID number of each trainee. To 
fill irj thfe worksheet, the response data from each trainee's 
ausWer Isheet is transferred to the appropriate column-.' That 
is, .taking the Pre-Test answer sheet for trainee #01 and 
moving across the worksheet, enter a C Mqt each correct re- 
sponse) or an I (for an incorrect response) under ^he appro- 
priate item number. Then, enter the total number '&^nd the 
total number I in the appropriate boxes (to the rignt of the 
heavy black line) . The same procedure is then followed with 
the Post-Test answer sheet' .for trainee #01. After the re- 
sponses of both testings have been entered, the third hori- 
zontal column is used to indicate the response pattern for 
each item from Tefet to Retest, numbered as follows: 



Test (T) 
Retest (R) 



PIGU 

SAMPLE WORKSHEET POP 



* >P/MCH Program - 11/73-12/73 



' ITEM 
NUMBER 


1- 


2 


3 


4 


5 


6 






T 


c 


C 


I 


? 


I 


I 




01 


R 


c 


? 


I 


I 


c 


I 






— > 


r-. 




3 


S 


¥ 


,3 


Q 




T 


c 


Z 


I - 


c 


c 


I 


H 




R 


c 


C 


c 


c 


c 


1 


Da 






J 


1 




1 


\' 


3 


H 




T 


c 


C 


r 


I- 


I 


C 


i 


03 




c 


C 


r 


c 


c 


I 






7 


1 




3 


•* 






^ 












. T 


c 


r 


Is 


I 


c 






28 


R 


I 


c 




c 


c 








— > 




•1 




*^ 










T 


c 


c 


C 




J 


C 




29 


• R 


c 


c 


c 














1 


I- 


f 






J 






T 


J 


I 


I 


c 


I 


1 \ 




30 


R 


c 


c 


J 


e 


r 


1 








J/ 


4 


3 


/ 


3 


3 






T 




t 


J 


/ 


X 


t 




31 


R 


c 


X 


I 


c 


C 


C 






— > 


1 




3 


<^ 


\ 


1 


TOTAL 
COPfeCTi 


T 


9 


/o 


7 




22. 


14 






R 


il 






;i7 


a; 


3.(0 


(1) 


C->C 


3 






to 












f 


1 


1 


1 




(3) 












3 




(4) 








fO 


r7 


6 






185. 



5 P3 



:tem' pattern analysis 



) . • . ' . 


MS 


49' 


50, 


51 


c 


V 


■ (1) 
' c— >c 


(2) 
C— 


(3) 
I— »,I 




C 


c 


C 


c 














c 


I 


c 


c 




17 








1 I 










c i 


1 9P 












c 


c 


c 


C i 


L . 


9 


1 1 






1 




I 


c 


c 


1 1 






r 




73, 


: ; 


c 


c 


I 






1 1 










ERIC 



21- 



186 



J 



The l^ast four vertical columns are filled in according to the 
frequency o^ occurrence of each of the four Pre-Test to Post- 
'Test item patterns for ea^ch trainee. 

All'^of this information is entered for ^^ch trainee, and the 
vertical columns are then totalled. Aci^as the bottom of the 
page are the response patterns for eachMtem: the' number of 
times they were answered correctly on%tii^ Test and Retest, the 
number of time§ they were correct on ^^l^^t incorrect on both, 
and the number of times they went from correct to incorrect 
02: from incorrect to correct. 



Down the right7hand columns are the same pfattems as they 
apply to each trainee: how many col:rect responses, how many 
times they were correct both times, incorrect both times, how 
many times their responses were incorrect on the Pre-Test and 
correct on the Post-Test and how many times they were correct 
on the Pre-Test and incorrect on the ^Qst-Test. 

Finally, both horizoitfeal and* vertical columns are totalled in 
the lower right-hand >corner^ (bel<^ and to the right of the 
heavy lines) . This serves as a check against errors in cal- 
culations or recording--the totals should b&^he same for both 
the horizontal and vertical cc^lumrts. 

Note ; Using the worksheet, a table can be constructed display- 
ing the totals only,^ figure F4 is an example of a table dis- 
playing the summary scores and pattern frequencies for each 
item (on the worksheet, th§ table values ^correspond to the • 
totals *for ^ach horizontal column) . Figure F5 is an example , 
of a table summarizing the , scores and response patterji ** 
frequencies for each trainee (the table values represent the 
totals for each vertical column on the worksheet) . The table 
in Figure F5 is similar to the type of table used in the dis- 
cussion of the item response pattern analysis on pp, 91^98 ♦ 
When the table has been completed and 'the ajjpropriate sta- 
tistics (i.e., percentage^) calculated, the analysis of the 
data will be conducted as described on pp. 94-98. ^ 



ERIC 



212 



rlGXTRB F4 



ITEM RESPONSE PATTERNS 
BY INDIVIDUAL ITEM 



187 



^ G 







ilAL 


PRE 


POST 




(2) 












TOTAL 


ITEM 


^ COR^CT 


C-- O 


C-- X* 


INCORRECT 


PRE 


^ .POST 






PRE 


POST 


• 1 


'\ 8 


11 


3 


5 


23 


20 


2 


1 10 . 


ai 


6 


4 


21 


20 


3 


1 7 


16 


6 


1 


24 


. 15 


4 


ii 11 


21 


10 


1 


20 


4 


5 


: • 22 


XI 


21 


1 


9 


4 


6 


i! 14 


26 


. 13 


1 


17 


5 


7 


26 


30' 


^5 


1 


5 


1 


8 


j 23 


}^ 


18 


5 • 


8 


6 


9 


' 14 


9 ' 


4 


10 


17 


22 


10 


i ' 21 


22 


14 


7 


10 


9 


11 




^7 


19 


2 


10 


4 


1 i. 


1 15' 


26" 


• 13 


2 


16 


5 


1 

1^ 


1 ' 


29 


23 


1 


7 


2 




19- 

!t 22 


27 
.tl 


18 
18 


1 
4 


12 
9 


4 
10 


. 16 


20 


29 


18 


* 2 


11 


2 


17 


' ^6 


22 


20 


6 


5 


Q 


18 
^ 19 


* 25 


♦ 30 


24 


1 


6 






3 


1 


3 


27 




20 




13. 


0 


3 




1 P 


21 


i 7 


1 


1 


6 


24 




. 22 . 


1 6 


21 


' 6 


0 






23 


\ 0 20 


21 


16 


4 


1 1 




24 • 


' 27 


29 


27 


0 






• 25 


17 


24 


15 


a' 2 




7 


26 • 


15 


27* 


12 


3 




4 


'Jo i 


♦ 7 


27 


6 


1 


24 


4 


: 2a 1 


23- 


28 


23 


0 


8 






2d 




23 


2 


6 


3 


30 ] 


5 


18 


4 


1 


26 


13 


31 


17 


2^ 


13 


4 


• 14 


7 


32 • ; 


1 


17 


' 1 


0 


30 


14 


33 


17 


31 


1-7 


0 


14 


0 


34 t 


14 


21 


12 


2 


' 17 


10 




23 


29 


23 


0 


. 8 


2 


36 


10 


21 


8 


^ 2 


21 


10 


1 


24 


30 


23 


" 1 


7 


1 


38 


5 


16 


. 3 


' 2 


^26 


16 


39 


10 


9 


/ 4 




21 


22 


4d, 


10 


23 


8 


2 


21 


8 


41 


20 


30 <s 


19 


1 


11 


1 


42 


1 3 


') ^ 


12 


1 


18 


6 


43 




1 C\ 


6 




20 


21 


44 


1 n 


OQ 

^7 


10 


/ 0 


•> 1 
^1 


2 


45 


b 
o 




8 


0 


0 'J 
ZJ 


6 


46 


6 


21 


6 


0 


26 


10 


47 


18 


26 


17 ^ 


1 


13 


5 


48 


11 


12 


4 


7 


20 


19 


49 


4 


14 


4 


• 0 


27 


17 


50 ' 


5 


0 


0 


5 


26 


31 


5^ 


13 


1^ 


10 


3 


18 


12 


TOTAL 


737 


1087 


615 


122 


844 


494 




46.6% - 








53.4% 








68.8%- 




-16,6% 








100%< 




-83.4^-- 


lOOV. 


< 



PRE P ^T 



r— I 



15 
16 
14 

3 

3 

4 

0^ 
1 
12 
2 
2 
3 
1 
3 

^ 6 

0 

3 

0 
25 
15 
24 
10 

6 

2 

5 

,1 

3 
3 

12 

3 
14 

0 

8 

2 

8 

0 
13 
16 

6 

0 

5 
16 

2 

6 

id 

4 
12 
17 
26 

9 



8 


31 


5 


31 


10 


31 


. 47 


ai 


6 


- 31 


13 


31 


5 


31 


7 


31 


• 5 


31 


8 


31 




.31 


13 ' 


31 


6 


31 


9 






/n 


.11*^ 


31 


2 


31 


' 6 . 


31 


2 


31 


13 


31' 


'0 


31 


15 


31 


5 


31 


. 2 


. 31 


9 ^ 


31 


15' 


31 


• 21 
B 


31 
31 


5 


31 


14 


, 31 


11 


31 


16 


3l 


14 


31 




Z\ 


6 


n 


13 


31 


7 


31 


13 


31 






15 


31 


11 


31 


13 


31 


4 


31 


19 


31 


17 


31 


15 


31 


9 


31 


8 


31 


10 


31 


0 


31 


9 


31 


472 


1581 




-♦100% 



372 



,44.1% 55,9% 



->100% 



188 



PIGtJRB P5 



ITEM RESPONSE PATTERNS 
BY INDIVIDUAL TRAINEE 





(1) 


PRE 


-> POST 


(2) 




PRE 


> POST 


TOTAL 




TOTAL 






.TOTAL 








CORRECT 


C— >C 


C— >I 


INCORRECT 


I— >I 


1—>C 


ITEMS 


ID 


PRE 


POST 






PRE POST 






(1 + 2) 


X 


2ff 


34 


24 


4 


23 


11 


13 


10 


51 


2 


30 


42 


26 


4 


21 


9 


5 


16 


51 




13 


19 


6 " 


7 


38 • 


32 


25 


13 


51 


4 


26 


43 




1 


25 


8 


7 


18 


51 


5 


23 


36 


1 Q 


4 


28 


15 


11 


17 


51 


b 


29 


37 


28 




n 22 ^ 
^ 18 


14 


' 13 


9 


51 


7 


33 


33 








13 


5 


51 


8 ! 


23 


31 


16 


7 


28 




13 


15 


51 


9 


28 


32 


2'2 


5 


23 




13 


10 


51 


10 


29, 


34 




4 


22 


17 


13 


9 


51 


11 


25- 


34 


21 


' 4- 


. 26 


17 


13 


13 


51 


12 . 


•13 


29 


12 




38 - 


'22 ' 


2-1 ' ' 


17 


51 


13 


18 


35 




" 3 


33 


16 


13 


20 


51 


14 


22 






5 


29 


16 


11 


18 


51 


1 c 
L 0 


24 


36 


20 


4 


27 


15 


11 ' 


16 


51 




16 


36 


11 


5 


35 


15 


10 


25 


51 


' 17 


31 


41 


30 


1 


20 


10 


9^ 


11 , 


51 


IlS. 


20^ 


36 


17 


3 


31 


15 


12 


19 


51 


* 19 


19 


36 


16 


3 


32 


15 


12* 


20 


51 


'20 


21 


41 


20 


1 


30 


IP 


9 


. 21 


51 


21 


29 


37 


25' 


4 


. 22 


14 . 




1 z 


51 


22 


8 


31 


o* 5 


3 


43 


20 


- 17 


26 


51 


23 ^ 


27 


35 


22 


5 


24 


16 


11 . 


13 


51 


24 " 


29 * 


36 


23 


6 


22' 


15 


9 


.13' 


51 


25 


28 


38 


r 26 . 


2 


23 


13 


11 


12 ' 


51 


^ 26 


• 23 


38 


21 


2 


28 


13 


11 . 


17 


51 


27 


27 


29 


15 . 


12 


24 


22 


10 


14 


51 


28 • 


29 


. 39 / 


25 


4 


2Z 


12 


8 


' '14 


51 


29 


18 


38. 


16' 


2 


33 


13- 


11 


- 22 


51 


30 


22 


. 35 


> 16 


6 


25 


16 


10 


19 


' 51 


31 


1 26 


31 


23 


3 


20 ' 


17 


8 


51 


TOTAL: 


737 • 


1087. 


615 ' 


122 


844 


494 


37 2 


0 472 


1581 


46.6^-" 














->100%'^ 






68.8% 












r >100H.r. • 




100% <■ 




-83^4% 


--16.6^' 


100%^ 




- -44.1%-- 


'-55'.9% 





ERIC 



214 



APPENDIX G 



MATHEMATICAL EQUATION AND PARAMETER VALUEtf FOR GENERATING 
THE WEIGHTED ACHIEVEMENT/COMPETENCE- SCORE CURVES 

The general equation that describes mathematically the family 
of curves, one series of* which is presented in Figure 15, is 

(R - T) N^_ ^ J 



2 2 
- T"^ 



where: T = Test Score 

R = Retest Score 
N = Total Number of 

Items in Test In~ 

strument 
I = Weig/fiited Achievement, 

Competence Score 



The values of the T, R and I parameters* used to generate the 
specific curves presented in the figure are as follows; 



.J 


=0 


1= 


10 


1= 


20 ^ . 


I 


=30 


I 


=40 » 


T 


R, 


T 


R 


T 


R 


T 


R 


T 


R 


D 


0 


0 


10 


0 


20 


0 


30 


0 


40 


10 


10 


V 10 


19.9 


10 


29.8 


10 


39,7 


10 


49.6 


20 




20 


29.6 


20 


-39.2 


20 


■^8,8 


20 


58.4 


30 


3o\; 


' ' 30 


39.1 


• 30 


48.. 2 


30 


57.3 


30 


66.4 


MO 


40 . 


45. 


48.4 


40 


56.8 


40 


65.2 


• 40 


73.6 


50 


50 


' 50 . 


57.5 


50 


65.0 


. 50 


72. B 


50 


• 80,0 


*60 


60 


60 


66.4 


feo 


72.8 


60 


79.2 


60 


85.6 


70 


70 


70 


75.1 


7a 


80.2 


70 


85.3 


70 


90.4 


^0 


80 ' 


80 


83.6 


80 


87.2 


80 • 


90.8 


80 


94.4 


9(j 


90 


90 




90 


93.8 


90 


95.7 


90 


97.6 



1=50 



' 0 50 ^ 

10 59.5 

20 66.0 

30 75*5 

40 82. 0'^ 

50 '87.5 

6'0 92.0 

70 95.5 

^80< 98.0 

90 99.5 



*N=100 



* .1= 


60 


1= 


70 


1= 


80 


I 


=90 


T 


R 


T 


R' 


T 


R 


T 


■< /^ 


0 


60 


0 


70 


0 


80 


0 


90" ' 


10 


69.4 


10 


79.3 


10 


89.2 


10 


99.1 


20 


77.6 


20 


87.2 


20 


96.8 






30 


84.6 


30 


93.7 










40 


90.4 


40 


98.8 ^ 










50 


95.0 














60 . 


98^.4* 














7^' 


99.6 















^89 



REFERENCES 



Notes to the Text 

1 • . ' 

Sherman N. Tinkelman, Planning the objective test. In 
Robert L. Thorndike (Ed.), Educational measurement . (2nd ed.) 
Washington, D. C: American Council on Education, 1971. p. 51, 

? 

Robert L. Ebel, Meas uring educational achievement. Engle- 
wood Cliffs, N. J.: Prentice-Hall, 1963. Pp. 68-72. 

3 

Norman E. Gronlund, Measurement and evaluation in teaching . 
(2nd ed.) New York:^ Macmillan, 1971. p. 130. 

4 

Gronlund, Measurement and evaluation in teaching , p. 224. 

5 ' ^ 

^ ■ Ebel, Measuring educational achievement , p. 60. 

6 

Lee J. Cronbach, Educational psychology . (2nd ed.) New 
York: Harcourt, Brage & World, 1963. p. 554. 

7 * ' 

Tinkelman, in Thorndike, Educational measurement , p. 50. 

8 

Guidelines have been adapted from discussions of item con- 
struction in: Robert L. Thorndike, & Elizabeth Hagen, Measure- 
ment and evaluation in psychology and education . (2nd ed. ) 
New York: John Wiley, 1961. Pp. 61-85; Richard Lindeman, 
Educational measurement . Glenview, 111.: Scott Foresman, 
T967T ]Pp. 84-85; and Alexander G. Wesman, Writing the test 
item, rn JR(!jbert L. Thorndike (Ed.), Educational measurement . 
(2nd ed.) Washington, D. C: American Council on Education, 
1971. Pp. ^;Sl-129. ' ' 

Ebel, Measuring educational achievement , p. 468. 

Leslie Kish, Some statistical problems in research design. 
American Sociological Review , 1959, 24, 328-338. 



11 



IHanan C. Selvin, A critique of significance in survey 
res^rch. American Sociological Review , 1957^ 22, 519-527. 

ERIC ^^^2!r> 



192 



12 ^ 

Ebelr Measuring edi^gCtional achievement , p. 81. 

13 

Paul B. Diederich, Pitfalls in the measurement of gains in 
achievement.^ In William C. Morse & G. Max Wingo (Eds.), 
Readings in educational psychology . Fair lawn, N. J.^ Scott 
Foresman, 1962. p. 363. 

14 

Lindeman, Educational measurement , p. 89« 



Notes to the Appendices 

1 ^ t 
Benjamin Bloom (Ed>), Taxonomy of educational objectives: 
the classification of educational goals. Handbook 1. The 
cognitive domain" New York; David McKay, 1956. Pp. 28-29. 

2 

Lee J. Cronbach, Essentials of psychological testin g. (3rd 
ed.) New York: Harper & Row, 1970. p. 24. 

3 ' 

Donald T. Campbell, & Julian C. Stanley, Experimental and 
quksi-experimental designs for research . Chicago: Rand 
McNally, 1963. p. 7. 

4 

Campbell, & Stanley, Experimental and quasi-experimental 
designs for research , p. 7. 

5 

Campbell, & Stanley, Experimental and quasi-experimental 
designs for research . Pp. 7-12. 

6 

M. J. Apter, D. Boorer, & S. Murgatroyd, A comparison of 
the effects 9f multiple-choice and constructed response pre- 
tests in progranmied instruction. Programmed liearning , 1971, 
8((i), 251-256. 

7 

Campbell, & Stanley, Experimental and quasi-experimen1:al 
designs for research , p. 48. 

8 

Alexander W. Astin, ^ Robert J. Pai)os, The, evaluation of 
educational program^. Ih Robert L. Thorndike (Ed.), Educa- 
tional measurement . (2nd ed.) Washington; D. C. : American 
Council on Education, 1971-. p. 744. 

217 



193 



9 

•Guideline's adapted from: Gronlund, Measurement and 
evaluation in teaching . Pp. 183-192 & 196-216; ^.Wesman, in 
Thorndike, Educational measurement . Pp. 81-128; Thorndike, & 
Hagen, Measurement, and evaluation in psychology and education 
(2nd edj. Pp. 60-95; and Robert L. Thorndike, & Elizabeth 
Hagen, Measurement and evaluation in psychology and education . 
(3rd ed75 New York: John Wiley, 1969. Pp. 104-116. 




T 



194 



BIBLIOGRAPHY 

» * 
Cited Sources ' . -^8^ 

kpt^r, M. J., Boorer, D., & Murgatj;oyd, S. A comparison of 
the effects of multiple-choice and constructed response 
pre-tests in progranmed instruction. Programmed Learning , 
1971, 8(4) , 251-256. 

Astin, Alexander W. , & Panos, Robert J. The evaluation of 
educationa]^ programs. In Robert L. Thorndike (Ed.), 
Educational measurement (2nd edition). Washington, D. C: 
American Council on Education, 1971. Pp. 733-751 . 

Bloom, Benjaunin (Ed. ) Teixonomy of educational objectives ; the 
classification of educational goals. Hctndbook 1. Cognitive 
domain ." New .York: David McKay, 1956. 

Campbell, Donald , & St^Ml^Y/ Julian C. Experimental and 
guasi-experimental design^for research . Chicago: Rand 
McNally, 1963. ] 

Cronbach, Lee J. Educational psychology (2nd edition) . New 
York: Har court. Brace & World, 1963. 

C^ronbach, Lee J. Essentials of psycholocfical testing (3rd 
edition) . New York: Harper & Row, 1970. 

I 

Diederich, Paul B. Pitfalls in the measurement of gains ill 
achievement. In William C. Morse & G. Max Wingo (Eds.)/ 
Readings in educational psychology . Fairlawn, N. J. : Scott 
Foresman, 1962. Pp. 359-364. ^ 

Ebel, Robert L. M^suring educational achievement . Englewood 
Cliffs, N. J.: Prentice-Hall, 196^. 

Gw^nibnd, Norman E. Measurement and evaluation in teagRing 
^**^(2nd edition). Ne\\York: Macmillan, 1971. ' 

Kish, Leslie. Some statistical problems in research design. 
American Sociological Review , 1959, 24, 328-338. 

Lindeman, Richard H. Educational measurement . Glenview^ 111. 
Scott Foresman, 1967. 

Roscoe, John T. Fundamental reseaurch statistics for the 

behavioral sciences . New York: Holt, Rinehart & Winefton, 
1969. 

Selvin, Hanam C. A critique of significance in survey research. 
American Sociological Review , 1957, 22, 519-527. 



^ 210 



195 




Thomdike, Robert h., & Hagen, ^li\zabeth. Measurement an^ 
Ley 



evaluation in psychology ar>d ec3fucation f2nd edition) . ^New 
York: John Wiley, 1961. 



Thorndike, Robert L.^ & Hagen, Elizabeth. Measurement and 

evaluation in psychology and eduO^tioh (3rd edition) . New 
York: John Wiley, 1969. 

Tinkelman, Sherman N. Planning the objective test. . In Robert 
L. Thorndike (Ed. ) ^ Educational measurement {2nd edition). 
Washington, D. 'C. : f^erican Council on Education, 1971. 
^ Pp. 46-80. \^ 

Wesinan, Alexander G. Writing the test item. In Robert L. 
Thorndike (Ed.), Educational measurement {2nd edition). 
Washington, D. C.l American Couniiil on Education, 1971. 
Pp. 81-129. ^ 



Supplementary Readings 

Anastasi, Anne. Psychological testing (3rd edition). New York: 
Macmiiian, 1968. 

Dyet, Henry s. On the assessment of academic achievement. In 
William G. Morse & G. Max Wingo (Eds.), Readings in educa- 
tional psychology . Fairlawn, N. J.: Scott Foresman, 1962. 
Pp. 353-359. ^ 

Gerberich,, J. "ftV Specimen objective test items: a guide to 

achie\^ement test construction . New York: Longmans & Green, 
1956. I " 

i 

Guilfordj J. P. Fundamental statistics in psychology and 
education (4tk editjton) . New York: McGraw-Hill, 1965. 

Karmel, iLouis J. Measurement and evaluation in the schools . 
New aork: Micmill^n, 1970. 

Pierottl, Daniel J. -A., Irecorps, Philippe, & Revson, Joanne E. 
Une Analyse par I'eguxpe de formation . Rennes, France: 
EcolFNationale de la Sante Publique, Section de Sante et 
Protection de |la Famille, February 1974. 



ERIC 



2ZJ 



\ 



Manuals for Evaluation of Family Planning & Population Programs: 

#1 A Framework for the Selection of Family Planning Program Evaluation Topics by 
^ Jack Reynolds* ^ 

#2 A F/amework for the Design of Family]Plannmg Program Evaluation Systems by 
Jack Reynolds* 

#3 A Method for Estimating Future Caseload of Family Planning Programs, by Jack 
Reynolds and Rukmani Ramaprasad* 

#4 Operational Evaluation of Family Planning Programs Through Process Analysis, 
by Jack Reynolds* 

#5 The Fertility Panern Method: Estimation of Fertility Change by Retrospective 
Quasi -Cohort Analysis of Group-Specific Fertility Patterns, by Samuel M. Wishik 
and Donald W. Helbig 

#6' A Checklist for Evaluative Overviews of Family Planning Program Activities, by 
Jack Reynolds* 

#7 Couple- Years of Protection A Measure of Family Planning Program Output, by 
Samuel M. Wishik and Kwan-Hwa Chen 



#8 Evaluating Training Effectiveness and Trainee Achievement. Methodology for 
Measurement of Changes in Levels of Cognitive Competence, by Bernard G. 
Pasquariella and Samuel M. Wishik 

Also available in Spanish. i 



22i 



